If you haven't got a clue what this is all about, read Sections 1.1, 1.2 and then 6.
If you want to install a binary build and then use it, read Section 4.1 and then Section 2. You'd be well advised to look briefly at Section 6 since that's a quick way to get some idea what you'll get for your efforts.
If you want to compile it from source, read Sections 4.2 and 4.3. Cacheprof is written almost entirely in Haskell, and those sections tell you how to get hold of a compiler for it.
If you don't want to rebuild it all from source, but instead want to reconfigure a binary distribution to simulate a different cache, you don't need a Haskell compiler, just gcc. Read Section 2.4.
Otherwise, you'll have to bite the bullet and make your own decisions
what to read. It's a hard life :-)
Numerous other tools are available to tell you about miss rates of programs. For the most part, however, they are oriented at people trying to design better cache hardware. For example, many such tools allow you to simulate multiple cache configurations at once, which is helpful if you are trying to compare the merits of different cache designs.
Cacheprof is different: it is targetted at programmers and algorithm designers, with the intention that you understand, and hopefully improve, the cache behaviour of your programs. You "improve" your program by reducing the number of capacity, and to a less extent, conflict misses. It seems reasonable to assume that improvements on one cache arrangement will carry over to other cache arrangements, even though the absolute miss rates, etc, may differ. Cacheprof allows you to simulate a variety of caches, although this is not its primary intention.
Cacheprof only profiles data reads and writes. There's no attempt
whatsoever to profile instruction cache effects. For many programs,
miss rates for instruction traffic are insignificant compared to those
for data traffic. Most machines have split I and D primary caches,
so I references won't mess up the D-cache, or vice versa.
You compile your programs in the normal way (ie, with gcc/g++/g77), but under the supervision of cacheprof. A cache simulator is linked into the resulting executable.
When the program runs, all (data) references are trapped and sent to the cache simulator. After execution finishes, detailed information is dumped into the file cacheprof.out. An auxiliary program, cacheprint, reads cacheprof.out and prints summaries for the whole program and for each instrumented procedure. It also prints annotated versions of source files specified on the command line.
Section 6 shows an example of use.
The system works well enough to run large programs. I've cacheprof'd the compiler-proper ("cc1") in gcc-2.95.1; that builds and runs without difficulty. All 18 of the benchmarks in the SPEC CPU95 (8 in C, 10 in Fortran77) suite run fine -- that's quite a lot of code. bzip2 of course runs fine.
Because cacheprof instruments assembly code, you should be able to profile C++, Fortran77 or any other language that the GCC suite will compile. I haven't actually tested it on anything except C and Fortran, but I plan to do so soon. The disadvantage of instrumenting assembly code is that you're tied to the particular architecture -- x86 (AT&T syntax) -- although I would like to be able in the longer term to port cacheprof to RISC architectures. Oh, and to the processor formerly known as Merced.
Profiled code is 5 to 6 times larger, than normal and runs 20 to 30 times slower than normal, since every memory reference is trapped and sent to the cache simulator. Cacheprof tries very hard to detect and instrument every single memory access, right down to the implicit stack reads and writes created by CALL and RET instructions. REP-prefix instructions are too hard to instrument directly, so they are first translated to equivalent sequences of simpler instructions; eventual profiling results (of course) pertain to the original version, not the simplified one.
Some folks just want to know instruction, read and write counts, and do not care about cache misses. For these folks I have provided a less expensive form of profiling, which I call "level 1" profiling. Level 1 profiled binaries are "only" twice as big, and 10-15 times slower than normal. The default profiling level is level 2, which gives the full cache simulation discussed above. You have to ask specially for level 1 profiling. You can't mix level-1 and level-2 profiling; if you do, the resulting executable will print an error message and refuse to do anything else.
You can link unprofiled code into your executables. Counts pertaining to it will not show up in the results, but everything should work fine. Indeed, since you have to link at least the C library into pretty much everything, you're obliged to link some unprofiled code into each executable. For more accurate results, it would be best to build cacheprof'd versions of the C library and its friends (libm et al), but I haven't tried that. Yet.
The default cache arrangement is a 32 kbyte cache, with 32 byte lines and 2-way set associativity. I chose that since that's (allegedly) how the level 1 D-cache of my AMD K6-2 is arranged. Cacheprof can simulate a cache with any combination of the following parameters: size 1k, 2k, 4k, 8k, 16k, 32k, 64k, 128k, 256k, 512k, 1M, 2M, 4M, 8M, line size of 8, 16, 32, 64 or 128 bytes, and associativities of 1 (direct-mapped), 2, 4 and 8. That covers 99% of the caches you'll encounter in practice. For associative caches, a LRU replacement policy is simulated.
To change the cache arrangement, edit the Makefile and do 'make cacheclean
; make'. You'll then need to re-link your programs.
The situation with cache misses is more complicated, since the cache effectively records the recent memory history of the entire system. Imagine a program which rapidly alternates between two procedures, P1 and P2. Each procedure fills the cache up with its own stuff every time it gets called. That means that both P1 and P2 will have high miss rates. If the program is recompiled so that P1 is instrumented but P2 is not, P1 may now be charged with a substantially lower miss rate, because the simulator never sees P2 throwing P1's stuff out of the cache, and therefore never sees P1 dragging it all back in.
Moral: cache miss behaviour is a global property, and it is misleading to think of P1 or P2 as simply causing cache misses' in isolation; rather, we should regard them as mutually conflicting. You have been warned :-)
Executive summary: omission of parts of the program from instrumentation decreases instruction and reference counts in expected ways, but may have non-obvious effects on miss rates.
I reckon that you are fairly safe, all in all, if all your code is instrumented,
and your program spends 99% of its time in it. If it spends a lot
of time in the C library or in system calls, you're on dodgy ground.
Cacheprof passes the -g flag to gcc/g++/g77, even if you didn't ask for it, so that it can figure which parts of the assembly code relate to which parts of the original source program. If you specify -O or -02 as well, you get the same kind of slightly illogical mapping as you do when single-stepping code compiled -O -g. As with debugging, if you need exact line number correspondence, turn off -O. Miss counts are an algorithmic property, so operating without -O dramatically increases the instruction, read and write counts, but has almost no effect on the miss counts.
Cacheprof instruments all reads and writes, even those setting up and clearing stack frames. This fools GDB, which seems to detect procedure boundaries at least partly by looking for prologue and epilogue sequences it knows about. The results of GDBing a cacheprof'd binary are ... well, complete chaos.
Cacheprof tries to preserve the behaviour of correct programs. If you code reads uninitialised memory, writes beyond the end of arrays or otherwise shows signs of moral and spiritual decay, it may well behave differently when instrumented. I spent hours searching for a (non-)bug in cacheprof for this very reason. Remedy (and good general advice): check for bad behaviour using Purify (real-money software, alas), or Tristan Gingold's GPLd Checker (www.gnu.org); the latter seems to work on x86-linux with gcc-2.95.1, at least.
I don't guarantee to cover all the x86 opcodes that GCC will ever spew out. If you're unlucky enough to discover one, cacheprof (cacheann, actually) will die thusly, usually on some way-out mutant floating-point opcode:
(stdin):3529: syntax error on `fild -2(%ebp)'
or it might die complaining of an "unhandled instruction set artefact". If so, do mail me (jseward@acm.org) at least the message, and preferably entire assembly input, so I can fix it. Alternatively, fix it as follows: the above message means that fild opcode is not implemented. Fix it by adding the opcode to the Opcode data defn in Arch_x86.hs and giving it an entry in the big table which follows, and then 'make cleanish && make' to rebuild. "Unhandled instruction set artefact" and other complaints are only a little harder to fix yourself.
The driver might well complain about "can't map flags", or worse, act wrong, if you pass obscure flags to your compiler. Please mail me if that happens.
I'm sure there are other limitations, but I can't remember them right
now. Today is 29 November 1999; just enough time left to slip in
a subtle Y2K bug of some kind :-)
gcc -Wall
-O -I/path/to/includes -c foo.c
gcc -Wall
-O -I/path/to/includes -c bar.c
gcc -o
xyzzy foo.o bar.o -L/path/to/libs -ldizzylizzy
change your Makefile (or whatever) so that this happens:
cacheprof
gcc -Wall -O -I/path/to/includes -c foo.c
cacheprof
gcc -Wall -O -I/path/to/includes -c bar.c
cacheprof
gcc -o xyzzy foo.o bar.o -L/path/to/libs -ldizzylizzy
This example links in the library libdizzylizzy, and you should of course have previously built that with cacheprof, although the program will still work if you haven't. See the Section 1.5 for a discussion of what happens if you don't profile some parts of your program. Cacheprof hasn't been tested for compiling position-independent code, so you'd be best off compiling libdizzylizzy statically, ie to libdizzylizzy.a.
You then run your program in the usual way. It should produce the message
==cacheprof== level-2 instrumented program: startup
and, on completion, a summary, like this:
==cacheprof==
level-2 instrumented program: run complete
==cacheprof==
1,259,420,409 insns
==cacheprof==
209,716,905 refs (
703 rd + 209,716,202 wr)
==cacheprof==
117,965,302 misses (
400 rd + 117,964,902 wr)
==cacheprof==
93.05 seconds, 13.53 MIPS
Detailed information from the run is dumped into the file cacheprof.out. It's a text file; you can read it if you really like, although you won't be much the wiser. (Note: the miss rate for this example program is atypically large, around 50%. More reasonable workloads are closer to the 5% mark).
Run cacheprint in the directory where cacheprof.out exists to see whole-program and per-procedure summaries. To see annotated source code, pass the names of the source files in question to cacheprint, for example: cacheprint foo.c bar.c. Cacheprint doesn't have any other options -- it's simple to use. Source lines which generate memory accesses are annotated with two numbers: the number of references and the number of misses.
Cacheprint will attempt to guess where procedures begin in source files, and print a summary for the procedure at that point. It's guessing heuristic is pretty ropey and sometimes gets it a few lines too early. If you have inlined functions in your program, the results will be utterly confusing (but still correct, I hope).
Cacheprof.out is created afresh after every run. Profiled binaries
also append a single-line summary to cacheprof.out.summary. This
is an experimental hack facilitating accumulation of costs over multiple
runs.
cacheprof --oink --moo gcc -Wall -O -I/path/to/includes -c foo.c
Allowable flags are:
Gprof-style profiling can't be combined with cacheprof-ing. If you specify -p or -pg, cacheprof will barf. Also, cacheprof needs to be able to sneakily grab the assembly output of your compiler, so -pipe isn't allowed either.
You can't debug cacheprof'd code, because the instrumentation confuses GDB. Cacheprof doesn't disallow the -g flag, but it will ignore it.
For all other flags, cacheprof tries to guess which particular phase
to pass the flag to. If it can't, it gives the flag to all phases,
and prints a warning message of the form "can't map flag ...". If
this happens frequently to you, edit the flag_mapping_table in CacheProf.hs
and recompile -- it isn't hard to figure out what to say.
Traditional gcc, g++, g77, as we know and love them, will translate input files, stopping at a point often indicated by a flag. For example, -E means "stop after preprocessing", -S stops at assembly, and -c produces object code.
Cacheprof makes gcc/g++/g77 behave as if there were a new stage, which lives between the normal assembly (-S) and object code (-c) phases: annotated assembly. The relevant flag is -SA, and the relevant file extension is .sA. You can do all the expected things, for example:
cacheprof
gcc -SA foo.c -- produces
foo.sA
cacheprof
gcc -c foo.sA bar.s -- produces foo.o and bar.o
What an amazing system :-)
cachegen 32768 32 2
and this writes, to stdout, a file which implements the specified cache, using an LRU replacement algorithm. To reconfigure cacheprof to use this new cache, first do 'make cleancache', then copy cachegen's output into a file called to cachesim_doref.c, then do 'make cachesim.o', and finally re-link your program. You don't need a Haskell compiler to do any of this.
Alternatively, edit the params in the Makefile, and do 'make cleancache
; make cachesim.o'.
Cacheprof takes care to ask the compiler(s) proper to emit debugging information into the assembly, even if you didn't ask for it. Cacheann uses this info to figure out, for each instruction which does a read or write, which source file, line and procedure to "send the bill" to. The debugging info is removed from the final annotated assembly, since, as previous discussed, GDB can't debug annotated code. If, for some reason debugging info isn't available, cacheann can usually guess the current file and function name for labels and other artefacts in the assembly source.
At the final linkage stage, cacheprof links in cachesim.o, which contains startup and shutdown routines, and the cache simulator proper.
Each instruction which reads and writes memory in the instrumented assembly has, associated with it, a "cost center" (try --ddump-ccs to see them). Just before each memory-touching instruction, cacheann inserts a call to an intermediate piece of link code, in cacheprof_hooks2_x86.s, passing as parameters the address of memory to be read/written, and the address of the cost center associated with this instruction. The link piece of code saves the machine state on the C stack, and calls into the cache simulator proper, again passing the address and cost center. The simulator updates the simulated cache. If a miss results, the miss count in the supplied cost center is incremented.
Control then returns to the link code, which restores the machine state and returns to the program proper. It's important to understand that the simulator does not and cannot actually simulate the contents of the cache, so it can't actually supply the result for memory reads. That is done by simply executing the instruction after returning from the link code.
When execution finishes, all cost centers and associated file, function and line-number information is dumped into cacheprof.out, in one great big mess. Cacheprint reads this and merges together information from all instrumented object files, so that it has a picture of the program's behaviour as a whole.
The mechanism is critically dependant on the fact that a call up to
the cache simulator is entirely transparent to the original program, providing
that the link code carefully saves and restores enough of the machine state.
On x86 we can just about get away with this because there are so few integer
registers, and none of the FP registers need be saved. For almost
all other architectures, a save/restore at each memory reference will be
prohibitively expensive, so in the long term it is probably better for
the straight-line code to dump (address, cost-center) pairs into a smallish
trace buffer and only call out-of-line to the simulator when the trace
buffer is full. This is all standard stuff in cache-simulation circles.
Unfortunately this will probably increase the size of instrumented code
still further.
You should be able to compile cacheprof with any compiler which implements the Haskell98 language standard correctly. The available candidates are: the Glasgow Haskell Compiler (ghc), version 4.04pl1 (or later; perhaps 4.06 by the time you read this), the Chalmers Haskell Compiler (hbc), version 0.9999.5b, and Niklas Rojemo's / Malcolm Wallace's nhc98 compiler, version 1.0pre13 or later.
I recommend you use ghc for production use; it produces fast code and has an unobtrusive run-time system. hbc also produces fast code, but unfortunately seems to have a bug which causes cacheann to sometimes loop, so I can't recommend this, alas. nhc98 produces smaller and slower binaries, but is very portable and easy to install. I suggest you use it to get started if you have difficulties installing ghc.
I should perhaps mention that my day job is as a ghc developer. If you encounter difficulties with ghc, mail me or some other member of our Merry Crew.
If you plan to do any serious tinkering with Cacheprof, you really ought to install the excellent interactive Haskell interpreter called "Hugs98". A common way to develop Haskell is to work inside Hugs, compiling to a standalone binary with one of the above compilers only when a production version is needed. Hugs98 is very portable, easy to install, and runs on most platforms.
All four systems are open-source of one kind or another (ghc, hbc and
hugs have BSD-style licenses), and all have binary distributions available
for at least x86-linux.
Either you know Haskell, in which case you won't want to translate into any other language, or you don't know Haskell, in which case by the time you find out enough about it, you'll find it a lot quicker to make the mods you need in Haskell rather than translate to C++/whatever.
Seriously, cacheprint has to deal with huge volumes of data and might benefit from a good C implementation. For the other two (cacheann, cacheprof), forget it.Of course, if you have some Worthy Proposal like building this functionality directly into gcc, I would lose my Bad Attitude and instead have a Very Good Attitude Indeed.
#include
<stdio.h>
#define
SIZE 1024
int arr[SIZE][SIZE];
void cmTraverse
( void ) {
int i, j;
for (i = 0; i < SIZE; i++)
for (j = 0; j < SIZE; j++)
arr[i][j] = (i + j);
}
void rmTraverse
( void ) {
int i, j;
for (i = 0; i < SIZE; i++)
for (j = 0; j < SIZE; j++)
arr[j][i] = (i + j);
}
int main
( int argc, char** argv ) {
printf ( "hello, world!\n" );
cmTraverse();
rmTraverse();
return 0;
}
The program traverses a big square array in both row-major and column-major fashion. cmTraverse and rmTraverse look almost identical, but their cache friendlyness is dramatically different. We now compile and run the program in the normal way, except that the compile command is presented to cacheprof:
sewardj@muraroa:~$ cacheprof gcc -O -Wall -o example example.c
sewardj@muraroa:~$
./example
==cacheprof==
level-2 instrumented program: startup
hello,
world!
==cacheprof==
level-2 instrumented program: run complete
==cacheprof==
12,594,205 insns
==cacheprof==
2,097,171 refs (
9 rd + 2,097,162 wr)
==cacheprof==
1,179,654 misses (
3 rd + 1,179,651 wr)
==cacheprof==
0.93 seconds, 13.54 MIPS
The program prints some handy summary stats, and dumps detailed info into cacheprof.out. Running 'cacheprint example.c' then gives the following (you may need to make your window wider, and get out your magnifying glass):
12,594,205 instructions, 2,097,171 references, 1,179,654 misses.
SIZE
RD-&-WR %
MISSES % RATE
---------------------------------------------------------------
TOTAL:
2,097,171 100.% 1,179,654
100.% 56.2%
1 byte
0 0.0%
0 0.0% NaN%
2 byte
0 0.0%
0 0.0% NaN%
4 byte
2,097,171 100.% 1,179,654
100.% 56.2%
8 byte
0 0.0%
0 0.0% NaN%
SIZE
READS %
MISSES % RATE
---------------------------------------------------------------
TOTAL:
9 100.%
3 100.% 33.3%
1 byte
0 0.0%
0 0.0% NaN%
2 byte
0 0.0%
0 0.0% NaN%
4 byte
9 100.%
3 100.% 33.3%
8 byte
0 0.0%
0 0.0% NaN%
SIZE
WRITES %
MISSES % RATE
---------------------------------------------------------------
TOTAL:
2,097,162 100.% 1,179,651
100.% 56.2%
1 byte
0 0.0%
0 0.0% NaN%
2 byte
0 0.0%
0 0.0% NaN%
4 byte
2,097,162 100.% 1,179,651
100.% 56.2%
8 byte
0 0.0%
0 0.0% NaN%
1 source files, 3 function
names, and 21 program points
are mentioned in `cacheprof.out'.
FUNCTION NAME
RD-&-WR % cum-%
MISSES % cum-%
RATE
--------------------------------------------------------------------------------------
rmTraverse(example.c)
1,048,583 49.9% 49.9% 1,048,579 88.8%
88.8% 99.9%
cmTraverse(example.c)
1,048,581 49.9% 99.9% 131,074
11.1% 99.9% 12.5%
main(example.c)
7 0.0% 100.%
1 0.0% 100.% 14.2%
The remaining 0 functions
contributed less than 0.01% of the reads-&-writes,
and less than 0.01% of
the misses.
--------------------------------------------------------------------------------------------
--- Annotated source: example.c ------------------------------------------------------------
--------------------------------------------------------------------------------------------
#include <stdio.h>
#define SIZE 1024
int arr[SIZE][SIZE];
void cmTraverse ( void )
2 0 {
int i, j;
for (i = 0; i < SIZE; i++)
for (j = 0; j < SIZE; j++)
1,048,576
131,073 arr[i][j]
= (i + j);
3 1 }
void rmTraverse ( void )
3 1 {
int i, j;
for (i = 0; i < SIZE; i++)
for (j = 0; j < SIZE; j++)
1,048,576 1,048,576
arr[j][i] = (i + j);
4 2 }
int main ( int argc, char** argv )
1 1 {
2 0
printf ( "hello, world!\n" );
1 0
cmTraverse();
1 0
rmTraverse();
return 0;
2 0 }
The sordid outcome of this investigation is that one of the row-major traversal is far worse than column-major traversal (I think I have the row- and column- descriptions the wrong way round, doesn't matter). The particular array sizes were chosen so as to illustrate the worst-possible behaviour for the simulated cache, which is the L1 data cache for an AMD K6-2. The miss rates differ by a factor of 8. Running the (uninstrumented) program 100 times on a real K6, the standard gprof profiling tool indicates that ...
%
cumulative self
self total
time
seconds seconds calls ms/call
ms/call name
68.64
11.93 11.93 100
119.30 119.30 rmTraverse
31.36
17.38 5.45 100
54.50 54.50 cmTraverse
0.00
17.38 0.00
1 0.00 17380.00 main
ie the real timing ratio is about 2.2:1 -- not as dramatic as the 8:1 difference in miss rates, probably since (1) executing instructions also takes time (alas :-), and (2) both procedures also miss badly on my weedy (Socket-7-typical) 512k second-level cache (arr is 4 megabytes in size). Perhaps I should have simulated that instead. Ah well.
Finally, timing the uninstrumented program shows it runs in about 0.17
second, and since we know it runs for 12.594 million instructions, the
execution rate is about 72 MIPS, which is pretty appalling for a 350MHz
K6-2. That just confirms that this program does very badly in the
cache -- by comparison, most "normal" programs run at between 150 and 250
MIPS on this machine. (note. MIPS == Meaningless Indication
of Processor Speed. Never forget that).
I started to look around for open-source tools which I could use to quantify memory/cache effects in bzip2, and was totally amazed not to find any. Having read many papers about sorting algorithms (the slow part of bzip2) I was also surprised to discover that the (academic) algorithm community places a lot of emphasis on developing algorithms with lower instruction counts, but almost no emphasis on designing more cache-efficient algorithms. I find this attitude quite illogical, given the relative costs of instructions vs cache misses on contemporary architectures.
For a long time I mused on how to build a cache profiler, but did nothing.
In early summer '99, I knocked up a new back end to the lcc C compiler. The back end emitted C (bizarrely), but instrumented so that memory references were passed to a cache simulator. It was a terrible hack, but it worked well enough to tell me interesting things about bzip2. For me, it "proved the concept", but wasn't a satisfactory solution. It was tied to C, to the specific C compiler, and didn't catch all references. I wasn't sure how to proceed.
Later in the summer, my excellent colleague Simon Marlow suggested that
it would be much simpler to parse and annotate assembly code. This
was the critical idea which made cacheprof practical. Many weekends
and late nights later, this is the result.