Walking down the hall towards the next set of talks, I was accosted by Miguel de Icaza and friends. He is standing outside the door to this room redirecting people to Nat's talk, promising entertainment and an overall wild time. Since I hadn't made up my mind which talk to go to, I decided to give it a try. I'll be missing the Netatalk discussion by Adrian Sun and "Public Relations" by Tom Geller, but Miguel is providing lots of entertainment here, even before the talk begins.
Nat, meanwhile, is from MIT. He's working on setup now. Miguel seems to be fairly effective. This is the first room for talks you come to, but also the smallest. Many of the talks in this room have been pretty sparsely attended compared to the other rooms. Any one heading to another talk has to get by Miguel first, which sounds like it is becoming difficult.
"Please go in here. Just do it ...", says Miguel. Wow, the room is filling up now. "Nat! Make some funny faces!" This is called out every few minutes or so, apparently to demonstrate that the talk will be amusing as well as enlightening.. "Nat! Wave to the nice people!" Not Miguel, this time, but the other procuror. Alan Cox has grabbed Miguel and forced him to sit down. The two of them are heading to the front. Apparently the harassment in the hallway had reached too high a level. No! He's escaped! Miguel heads to the back of the room. Well, it is still quiet, so he must have given up the task.
He achieved his goal, though. The room is comfortably full, with maybe ten or twenty open chairs.
Soon I may even find out what GNU Rope is ...
The presentation was down with Magicpoint, they were proud to announce.
What is GNU rope? Ah, this is the "grope" I've heard about.
It was inspired by SGI's cord program. It rearranges functions and optimizes programs so that they load faster, use less memory and are nicer to the cache.
So Miguel said, "We *must* have this program for Linux!" rope is a pun on "cord" but then creates a great word combined with GNU.
It produces a 30% reduction in memory use by gcc.
We are going to explain why rearranging the functions speeds things up. How do we do this and how can you use it.
Code is loaded into memory a page at a time. If functions are arranged so that functions that call each other are automatically loaded at the same time, paging is automatically reduced. Take functions that have "a high affinity" for each other and keep these functions on the same page with each other. This reduces memory consumption, improves load times and improves cache performance.
Typically "code coverage" is relatively low for big programs. For example, gcc compiling haifa-sched.c, only 976 functions out of 3000 some were actually used.
How does Grope work?
Example, everyone goes to slashdot, optimize Netscape for slashdot. The profiling tools that existing before he started were basically gprof. It hass callgraph, which show which functions call each other. Then it has a PC Histogram, which answers the question where the program spends its time. He wanted to gather other types of data, namely, how much memory does my program use over time? He also needed a framework for profiling modes. He created the PROFMODES environment variable so that you can modify the profiling modes at run time.
He showed a graph of memory use by gcc over time. The red lines showed the unoptimized (unmolested) gcc and the green one is the one that he "groped". (Richard was happy to see another "naughty" name in the GNU collection.) It showed a 30% reduction in memory use.
So after profiling, you need to determine your ordering. What are your goals? You want to reduce the working set size, the number of pages in memory at any time. How do we do this? We put related functions together. There are special cases where you want to do a really good job. "Dude, to do a really good job, you have to get the UltraSPARC case. If you can do this, that would be *awesome*." If you can take into account the specific behavior of cache on a specific architecture, it will make a tremendous difference.
Your input is your profiling data and executable. Your output is architecture-specific orderings of functions. His options for algorithms included gprof, the Node Grouping Algorithm, and more. The gprof algorithm was written by Jeff Law. It uses normal gprof callgraph data. It figures out what functions haven't been called and sticks them at the end. Figures out what is called a lot and sticks them at the beginning. Then the remaining functions are put together as a chain of functions calls, A calls B calls C. However, recent papers no longer use call graph data. It is not sufficient for the best optimization. We need more information about how child functions get called in comparison. Perhaps B and C need to group and D and E need to be grouped.
So he stored the information in a graph. You take the sequence of function accesses and you section off a window of them. For every window, make arcs between the functions. This generates a sequence graph, showing a relationship between functions that don't call each other directly. A call graph is a degenerate case of a sequence graph with a window of two.
So how do you coalesce this into an ordering? Algorithm described in paper by Pettis and Hansen from HP. They used it on undirected callgraphs.
Take all the arcs in the graphs and sort them by the weight. Take the top arc, the one with the highest weight. Contract them. Nat then showed a graph to demonstrate this. a and c had the highest number of arcs, so they are contracted into node ac, etc. Eventually you have one node left, acedb, which gives you your ordering. This is distance-based, no knowledge of how close they should be, just "as close as possible". For distance, someone at SGI recommended Simulated Annealing. It has its origins in metallurgical annealing. You want to strengthen metals and minimize grain boundaries. Brittle metal has clear grain boundaries where metal can break. At high temperatures, the grains move around. Slowly cool it, grains fall into a minimum level, nature's optimization.
So some physicists applied it to another system. If you have a system you want to optimize, tweak it slightly, ask if its better, if it is, switch to it, if it isn't, maybe switch to it, then do it again.
First, you need to be able to tell whether or not the system is "better". You need an energy function, if lower energy, switch to it, if higher, switch to it on a certain probability basis.
Start at a really high temperature, where you will have a higher probability of moving to a higher energy, then move to lower temperatures, where you will tend to move towards lower energy.
So how do you apply this to computer systems. 1) a random step function, very easy to do, needs the ability to insert a hole (not done yet) 2) energy function
The energy function is very cool because it allows you to handle all sorts of special cases. So he went back to the sequence graph. It would be best if arcs with a high affinity end up on the same page. So if they are on the same page, the energy is 1, if they are on two pages, the energy is 2, if they are scattered all over the place, the energy is very high ... Then you can add tweaks for cache design.
He worked on simulated annealing for a while. It works, but it is very slow (much like the metallurgical process). He learned things like only taking downhill steps actually kills you. His longterm contribution is that he knows a lot about the architecture and he can contribute a great energy function someday.
How do you use grope?
A code maintainer can gather profiling data, compute the function orderings and provide information so that the executable is ordered by the linker. The user compiles normally. However, the maintainer has to compile twice.
SGI uses pixie to get around the problem of compiling the program twice. You can just relink to turn on profiling and the linker will redict function calls. He hasn't done anything quite so nice yet.
The post-link optimizer operates on functional executables, needs to optimize it without relinking it. It requires a linker modification which causes the linker to emit modification data into the executable and you don't have to keep all the object files around.
What is the status?
The link-time optimizer is fully working, 30% less memory, better than twice as fast on many machines.
The ordering algorithms need runing. The post-link optimizer needs debugging. It will be released soon:
has a paper with more detailed information.
Questions
How do you feel about dynamic linking?
1) can we optimize dynamic libraries? Yes, we can. Not for specific programs, but for general use. Must gather profiling data for a lot of programs.
How parallelizing is the energy function?
Not very. You want to tranverse the landscape in a random manner. The landscape is huge.
How many arcs in gcc?
Doesn't remember, but he could look it up. He has some graphs, etc.
What about for a distribution like Red Hat?
Erase this question.
Is there a mechanism to manually specify special cases?
I was working on stuff like that a week ago.
Have you found many cases where that is needed?
Yes, you may have a case where you know information and you don't want to wait while the simulated annealer figures it out.
You want to optimize differently for different CPUs. Are you going to make it easy to tack on different CPUs and what needs to be done for this?
Yes, that would go into the energy function.
Inspite of the callgraph, you may want to optimize something that doesn't even get called as much. Mechanisms for this?
Yes, as mentioned above.
Simulated annealing is like heuristics. You may be able to cut down your annealing time.
Yes, you may be able to optimize by starting with a likely ordering scheme.
Have you considered Taboo search?
Don't know what it is.
It is a further application of heuristics to simulated annealing.
Comment: reodering the kernel is worth trying, but with 4MB pages, you don't get as much optimizing.
How long does it take to run the gprof, how much for annealing.
Maybe takes a minute for gprof. Erase this question
In terms of parallelizing it more, have you looked at genetic orderings?
Mentioned it, got a disgusted look from ?? so he hasn't looked at it.
Final note, the room did fill up totally, with a few people standing at the back ...