Cacheprof, version 0.1 (snapshot 991209)

jseward@acm.org
http://www.cacheprof.org
Copyright (c) 1999 Julian Seward
An open-source tool for investigating cache effects in programs
Cacheprof is licensed under the GNU General Public License, version 2


Contents of this manual

0  How to avoid reading this manual

1  Introduction

1.1  What it does
1.2  How you use it
1.3  Stability and performance
1.4  Are the results meaningful?
1.5  Other limitations of the implementation
1.6  The license

2  How to use it

2.1  Basics
2.2  Options for cacheprof
2.3  The annotated assembly (.sA, -SA) stage
2.4  Manufacturing new caches with cachegen

3  How it works

4  How to install it

4.1  Installing pre-built binaries
4.2  Building from source
4.3  Haskell resources
4.4  How to contribute

5  If you have problems

6  An example

7  How cacheprof came to be



0  How to avoid reading this manual

Since it is not in human nature to RTFM, here are some short-circuit hints.

If you haven't got a clue what this is all about, read Sections 1.1, 1.2 and then 6.

If you want to install a binary build and then use it, read Section 4.1 and then Section 2.  You'd be well advised to look briefly at Section 6 since that's a quick way to get some idea what you'll get for your efforts.

If you want to compile it from source, read Sections 4.2 and 4.3.  Cacheprof is written almost entirely in Haskell, and those sections tell you how to get hold of a compiler for it.

If you don't want to rebuild it all from source, but instead want to reconfigure a binary distribution to simulate a different cache, you don't need a Haskell compiler, just gcc.  Read Section 2.4.

Otherwise, you'll have to bite the bullet and make your own decisions what to read.  It's a hard life :-)
 

1  Introduction

1.1  What it does

Cacheprof is a tool to help you find out where and why your program causes cache misses.  Cacheprof will annotate source code with profiling results on a line-by-line basis, so you can find out precisely which lines of code are causing misses.  This is tremendously useful in understanding the cache friendliness or otherwise of programs and algorithms.  Each source line can be annotated with reference and miss counts.  Counts are also presented on a per-function basis.  For the program as a whole, cacheprof records the number of reads, writes, read misses and write misses, further categorised by whether they are 1, 2, 4 or 8 byte references.  Finally, cacheprof also counts the number of instructions executed.  All counters are 64-bit, so you don't need to worry about overflows.

Numerous other tools are available to tell you about miss rates of programs.  For the most part, however, they are oriented at people trying to design better cache hardware.  For example, many such tools allow you to simulate multiple cache configurations at once, which is helpful if you are trying to compare the merits of different cache designs.

Cacheprof is different: it is targetted at programmers and algorithm designers, with the intention that you understand, and hopefully improve, the cache behaviour of your programs.  You "improve" your program by reducing the number of capacity, and to a less extent, conflict misses.  It seems reasonable to assume that improvements on one cache arrangement will carry over to other cache arrangements, even though the absolute miss rates, etc, may differ.  Cacheprof allows you to simulate a variety of caches, although this is not its primary intention.

Cacheprof only profiles data reads and writes.  There's no attempt whatsoever to profile instruction cache effects.  For many programs, miss rates for instruction traffic are insignificant compared to those for data traffic.  Most machines have split I and D primary caches, so I references won't mess up the D-cache, or vice versa.
 

1.2  How you use it

Cacheprof is designed to be as non-intrusive as possible.  It works with the standard GNU toolchain, on x86 platforms.  You don't need rebuild or reconfigure any of the tools, but you will need to recompile programs in order to profile them.

You compile your programs in the normal way (ie, with gcc/g++/g77), but under the supervision of cacheprof.  A cache simulator is linked into the resulting executable.

When the program runs, all (data) references are trapped and sent to the cache simulator.  After execution finishes, detailed information is dumped into the file cacheprof.out.  An auxiliary program, cacheprint, reads cacheprof.out and prints summaries for the whole program and for each instrumented procedure.  It also prints annotated versions of source files specified on the command line.

Section 6 shows an example of use.
 

1.3  Stability and performance

Cacheprof is not a toy.  I developed it because I was frustrated at the lack of cache profiling tools when developing the data compressor bzip2.  In vain did I cast around the Internet looking for open-source cache profilers!  Eventually I gave up in disgust and made my own.  Section 7 contains further historical details.

The system works well enough to run large programs.  I've cacheprof'd the compiler-proper ("cc1") in gcc-2.95.1; that builds and runs without difficulty.  All 18 of the benchmarks in the SPEC CPU95 (8 in C, 10 in Fortran77) suite run fine -- that's quite a lot of code.  bzip2 of course runs fine.

Because cacheprof instruments assembly code, you should be able to profile C++, Fortran77 or any other language that the GCC suite will compile.  I haven't actually tested it on anything except C and Fortran, but I plan to do so soon.  The disadvantage of instrumenting assembly code is that you're tied to the particular architecture -- x86 (AT&T syntax) -- although I would like to be able in the longer term to port cacheprof to RISC architectures.  Oh, and to the processor formerly known as Merced.

Profiled code is 5 to 6 times larger, than normal and runs 20 to 30 times slower than normal, since every memory reference is trapped and sent to the cache simulator.  Cacheprof tries very hard to detect and instrument every single memory access, right down to the implicit stack reads and writes created by CALL and RET instructions.  REP-prefix instructions are too hard to instrument directly, so they are first translated to equivalent sequences of simpler instructions; eventual profiling results (of course) pertain to the original version, not the simplified one.

Some folks just want to know instruction, read and write counts, and do not care about cache misses.  For these folks I have provided a less expensive form of profiling, which I call "level 1" profiling.  Level 1 profiled binaries are "only" twice as big, and 10-15 times slower than normal.  The default profiling level is level 2, which gives the full cache simulation discussed above.  You have to ask specially for level 1 profiling.  You can't mix level-1 and level-2 profiling; if you do, the resulting executable will print an error message and refuse to do anything else.

You can link unprofiled code into your executables.  Counts pertaining to it will not show up in the results, but everything should work fine.  Indeed, since you have to link at least the C library into pretty much everything, you're obliged to link some unprofiled code into each executable.  For more accurate results, it would be best to build cacheprof'd versions of the C library and its friends (libm et al), but I haven't tried that.  Yet.

The default cache arrangement is a 32 kbyte cache, with 32 byte lines and 2-way set associativity.  I chose that since that's (allegedly) how the level 1 D-cache of my AMD K6-2 is arranged.  Cacheprof can simulate a cache with any combination of the following parameters: size 1k, 2k, 4k, 8k, 16k, 32k, 64k, 128k, 256k, 512k, 1M, 2M, 4M, 8M, line size of 8, 16, 32, 64 or 128 bytes, and associativities of 1 (direct-mapped), 2, 4 and 8.  That covers 99% of the caches you'll encounter in practice.  For associative caches, a LRU replacement policy is simulated.

To change the cache arrangement, edit the Makefile and do 'make cacheclean ; make'.  You'll then need to re-link your programs.
 

1.4  Are the results meaningful?

All very nice, but do the numbers mean anything?  If you think they are an accurate representation of what's really going on, you are sadly deluded.  The address sequence presented to the cache simulator differs from that encountered by the real hardware for at least the following reasons, and probably more: For instruction, read and write counts, the numbers are accurate for parts of the program which were instrumented, and all the other parts contribute nothing.

The situation with cache misses is more complicated, since the cache effectively records the recent memory history of the entire system.  Imagine a program which rapidly alternates between two procedures, P1 and P2.  Each procedure fills the cache up with its own stuff every time it gets called.  That means that both P1 and P2 will have high miss rates.  If the program is recompiled so that P1 is instrumented but P2 is not, P1 may now be charged with a substantially lower miss rate, because the simulator never sees P2 throwing P1's stuff out of the cache, and therefore never sees P1 dragging it all back in.

Moral: cache miss behaviour is a global property, and it is misleading to think of P1 or P2 as simply causing cache misses' in isolation; rather, we should regard them as mutually conflicting.  You have been warned :-)

Executive summary: omission of parts of the program from instrumentation decreases instruction and reference counts in expected ways, but may have non-obvious effects on miss rates.

I reckon that you are fairly safe, all in all, if all your code is instrumented, and your program spends 99% of its time in it.  If it spends a lot of time in the C library or in system calls, you're on dodgy ground.
 

1.5  Other limitations of the implementation

I haven't tried it on position-independent code (-fpic, -fPIC), which means that I don't know whether or not it works on ELF shared libraries (.so's).  In principle this should be possible, but since I haven't actually done so, it's unlike to work as-is.

Cacheprof passes the -g flag to gcc/g++/g77, even if you didn't ask for it, so that it can figure which parts of the assembly code relate to which parts of the original source program.  If you specify -O or -02 as well, you get the same kind of slightly illogical mapping as you do when single-stepping code compiled -O -g.  As with debugging, if you need exact line number correspondence, turn off -O.  Miss counts are an algorithmic property, so operating without -O dramatically increases the instruction, read and write counts, but has almost no effect on the miss counts.

Cacheprof instruments all reads and writes, even those setting up and clearing stack frames.  This fools GDB, which seems to detect procedure boundaries at least partly by looking for prologue and epilogue sequences it knows about.  The results of GDBing a cacheprof'd binary are ... well, complete chaos.

Cacheprof tries to preserve the behaviour of correct programs.  If you code reads uninitialised memory, writes beyond the end of arrays or otherwise shows signs of moral and spiritual decay, it may well behave differently when instrumented.  I spent hours searching for a (non-)bug in cacheprof for this very reason.  Remedy (and good general advice): check for bad behaviour using Purify (real-money software, alas), or Tristan Gingold's GPLd Checker (www.gnu.org); the latter seems to work on x86-linux with gcc-2.95.1, at least.

I don't guarantee to cover all the x86 opcodes that GCC will ever spew out.  If you're unlucky enough to discover one, cacheprof (cacheann, actually) will die thusly, usually on some way-out mutant floating-point opcode:

    (stdin):3529: syntax error on `fild -2(%ebp)'

or it might die complaining of an "unhandled instruction set artefact".  If so, do mail me (jseward@acm.org) at least the message, and preferably entire assembly input, so I can fix it.  Alternatively, fix it as follows: the above message means that fild opcode is not implemented.  Fix it by adding the opcode to the Opcode data defn in Arch_x86.hs and giving it an entry in the big table which follows, and then 'make cleanish && make' to rebuild.  "Unhandled instruction set artefact" and other complaints are only a little harder to fix yourself.

The driver might well complain about "can't map flags", or worse, act wrong, if you pass obscure flags to your compiler.  Please mail me if that happens.

I'm sure there are other limitations, but I can't remember them right now.  Today is 29 November 1999; just enough time left to slip in a subtle Y2K bug of some kind :-)
 

1.6  The license

Cacheprof is licensed under the GNU General Public License, version 2.  Read the file LICENSE in the source distribution for details.
 
 

2  How to use it

2.1  Basics

Compile as much of your program as possible (preferably all of it) in whatever way you did before, with one alteration: place the command 'cacheprof' in front of all invokations of gcc, g++ or g77.  For example, if you previously did this:

   gcc -Wall -O -I/path/to/includes -c foo.c
   gcc -Wall -O -I/path/to/includes -c bar.c
   gcc -o xyzzy foo.o bar.o -L/path/to/libs -ldizzylizzy

change your Makefile (or whatever) so that this happens:

   cacheprof gcc -Wall -O -I/path/to/includes -c foo.c
   cacheprof gcc -Wall -O -I/path/to/includes -c bar.c
   cacheprof gcc -o xyzzy foo.o bar.o -L/path/to/libs -ldizzylizzy

This example links in the library libdizzylizzy, and you should of course have previously built that with cacheprof, although the program will still work if you haven't.  See the Section 1.5 for a discussion of what happens if you don't profile some parts of your program.  Cacheprof hasn't been tested for compiling position-independent code, so you'd be best off compiling libdizzylizzy statically, ie to libdizzylizzy.a.

You then run your program in the usual way.  It should produce the message

   ==cacheprof== level-2 instrumented program: startup

and, on completion, a summary, like this:

   ==cacheprof== level-2 instrumented program: run complete
   ==cacheprof== 1,259,420,409 insns
   ==cacheprof==   209,716,905 refs   (          703 rd +   209,716,202 wr)
   ==cacheprof==   117,965,302 misses (          400 rd +   117,964,902 wr)
   ==cacheprof== 93.05 seconds, 13.53 MIPS

Detailed information from the run is dumped into the file cacheprof.out.  It's a text file; you can read it if you really like, although you won't be much the wiser.  (Note: the miss rate for this example program is atypically large, around 50%.  More reasonable workloads are closer to the 5% mark).

Run cacheprint in the directory where cacheprof.out exists to see whole-program and per-procedure summaries.  To see annotated source code, pass the names of the source files in question to cacheprint, for example: cacheprint foo.c bar.c.  Cacheprint doesn't have any other options -- it's simple to use.  Source lines which generate memory accesses are annotated with two numbers: the number of references and the number of misses.

Cacheprint will attempt to guess where procedures begin in source files, and print a summary for the procedure at that point.  It's guessing heuristic is pretty ropey and sometimes gets it a few lines too early.  If you have inlined functions in your program, the results will be utterly confusing (but still correct, I hope).

Cacheprof.out is created afresh after every run.  Profiled binaries also append a single-line summary to cacheprof.out.summary.  This is an experimental hack facilitating accumulation of costs over multiple runs.
 

2.2  Options for cacheprof

You can pass args to cacheprof before the compile-command, like this:

  cacheprof --oink --moo gcc -Wall -O -I/path/to/includes -c foo.c

Allowable flags are:

Cacheprof looks at the flags and args you pass to your compiler to figure out what to do.  This has some consequences:

Gprof-style profiling can't be combined with cacheprof-ing.  If you specify -p or -pg, cacheprof will barf.  Also, cacheprof needs to be able to sneakily grab the assembly output of your compiler, so -pipe isn't allowed either.

You can't debug cacheprof'd code, because the instrumentation confuses GDB.  Cacheprof doesn't disallow the -g flag, but it will ignore it.

For all other flags, cacheprof tries to guess which particular phase to pass the flag to.  If it can't, it gives the flag to all phases, and prints a warning message of the form "can't map flag ...".  If this happens frequently to you, edit the flag_mapping_table in CacheProf.hs and recompile -- it isn't hard to figure out what to say.
 

2.3  The annotated assembly (.sA, -SA) stage

This is useful background info for hackers.

Traditional gcc, g++, g77, as we know and love them, will translate input files, stopping at a point often indicated by a flag.  For example, -E means "stop after preprocessing", -S stops at assembly, and -c produces object code.

Cacheprof makes gcc/g++/g77 behave as if there were a new stage, which lives between the normal assembly (-S) and object code (-c) phases: annotated assembly.  The relevant flag is -SA, and the relevant file extension is .sA.  You can do all the expected things, for example:

   cacheprof gcc -SA foo.c         -- produces foo.sA
   cacheprof gcc -c foo.sA bar.s   -- produces foo.o and bar.o

What an amazing system :-)
 

2.4  Manufacturing new caches with cachegen

You can create a cache simulator for a variety of caches using cachegen.  Pass it, as parameters, the cache size in bytes, the line size in bytes, and the associativity, for example:

   cachegen 32768 32 2

and this writes, to stdout, a file which implements the specified cache, using an LRU replacement algorithm.  To reconfigure cacheprof to use this new cache, first do 'make cleancache', then copy cachegen's output into a file called to cachesim_doref.c, then do 'make cachesim.o', and finally re-link your program.  You don't need a Haskell compiler to do any of this.

Alternatively, edit the params in the Makefile, and do 'make cleancache ; make cachesim.o'.
 

3  How it works

Cacheprof inspects your compile command.  For each input which needs to be translated across the boundary from assembly code to object code, cacheprof arranges to intercept the assembly output, by breaking the original compile command up into pieces.  The assembly code is fed to cacheann, which parses and analyses the assembly, and spews out an instrumented version of it.  Cacheprof picks this up and steers it through subsequent assembly and linkage phases as necessary.

Cacheprof takes care to ask the compiler(s) proper to emit debugging information into the assembly, even if you didn't ask for it.  Cacheann uses this info to figure out, for each instruction which does a read or write, which source file, line and procedure to "send the bill" to.  The debugging info is removed from the final annotated assembly, since, as previous discussed, GDB can't debug annotated code.  If, for some reason debugging info isn't available, cacheann can usually guess the current file and function name for labels and other artefacts in the assembly source.

At the final linkage stage, cacheprof links in cachesim.o, which contains startup and shutdown routines, and the cache simulator proper.

Each instruction which reads and writes memory in the instrumented assembly has, associated with it, a "cost center" (try --ddump-ccs to see them).  Just before each memory-touching instruction, cacheann inserts a call to an intermediate piece of link code, in cacheprof_hooks2_x86.s, passing as parameters the address of memory to be read/written, and the address of the cost center associated with this instruction.  The link piece of code saves the machine state on the C stack, and calls into the cache simulator proper, again passing the address and cost center.  The simulator updates the simulated cache.  If a miss results, the miss count in the supplied cost center is incremented.

Control then returns to the link code, which restores the machine state and returns to the program proper.  It's important to understand that the simulator does not and cannot actually simulate the contents of the cache, so it can't actually supply the result for memory reads.  That is done by simply executing the instruction after returning from the link code.

When execution finishes, all cost centers and associated file, function and line-number information is dumped into cacheprof.out, in one great big mess.  Cacheprint reads this and merges together information from all instrumented object files, so that it has a picture of the program's behaviour as a whole.

The mechanism is critically dependant on the fact that a call up to the cache simulator is entirely transparent to the original program, providing that the link code carefully saves and restores enough of the machine state.  On x86 we can just about get away with this because there are so few integer registers, and none of the FP registers need be saved.  For almost all other architectures, a save/restore at each memory reference will be prohibitively expensive, so in the long term it is probably better for the straight-line code to dump (address, cost-center) pairs into a smallish trace buffer and only call out-of-line to the simulator when the trace buffer is full.  This is all standard stuff in cache-simulation circles.  Unfortunately this will probably increase the size of instrumented code still further.
 

4  How to install it

4.1  Installing pre-built binaries

Very easy.  Get hold of the relevant tarball, create an installation directory and untar in there -- at the moment I don't create the directory for you.  Set the environment variable CACHEPROFDIR to be that directory.  Then add $CACHEPROFDIR to your path.  The programs cacheprof, cacheprint, cachegen and cacheann should then be available.  You shouldn't run cacheann directly.
 

4.2  Building from source

Compiling from source is very easy, once you have a Haskell compiler installed.  Haskell compilers are freely available; see the next section for details.

You should be able to compile cacheprof with any compiler which implements the Haskell98 language standard correctly.  The available candidates are: the Glasgow Haskell Compiler (ghc), version 4.04pl1 (or later; perhaps 4.06 by the time you read this), the Chalmers Haskell Compiler (hbc), version 0.9999.5b, and Niklas Rojemo's / Malcolm Wallace's nhc98 compiler, version 1.0pre13 or later.

I recommend you use ghc for production use; it produces fast code and has an unobtrusive run-time system.  hbc also produces fast code, but unfortunately seems to have a bug which causes cacheann to sometimes loop, so I can't recommend this, alas.  nhc98 produces smaller and slower binaries, but is very portable and easy to install.  I suggest you use it to get started if you have difficulties installing ghc.

I should perhaps mention that my day job is as a ghc developer.  If you encounter difficulties with ghc, mail me or some other member of our Merry Crew.

If you plan to do any serious tinkering with Cacheprof, you really ought to install the excellent interactive Haskell interpreter called "Hugs98".  A common way to develop Haskell is to work inside Hugs, compiling to a standalone binary with one of the above compilers only when a production version is needed.  Hugs98 is very portable, easy to install, and runs on most platforms.

All four systems are open-source of one kind or another (ghc, hbc and hugs have BSD-style licenses), and all have binary distributions available for at least x86-linux.
 

4.3  Haskell resources

Everything you could possibly ask for -- introduction to the language, documentation, libraries, and links to the compilers, are conveniently located at http://haskell.org.
 

4.4  How to contribute

Cacheprof is very much a work in progress, and contributions are welcomed.  There are many useful things you can do.  Here's the contents of my TODO list:

Validation

The single most important task, since the system is worthless if its numbers are bogus.

Functionality

Portability

Non-projects

Either you know Haskell, in which case you won't want to translate into any other language, or you don't know Haskell, in which case by the time you find out enough about it, you'll find it a lot quicker to make the mods you need in Haskell rather than translate to C++/whatever.
Seriously, cacheprint has to deal with huge volumes of data and might benefit from a good C implementation.  For the other two (cacheann, cacheprof), forget it.

Of course, if you have some Worthy Proposal like building this functionality directly into gcc, I would lose my Bad Attitude and instead have a Very Good Attitude Indeed.
 
 

5  If you have problems

Mail me (jseward@acm.org).
 

6  An example

Here's a short example, using the following program (called example.c):

   #include <stdio.h>
   #define SIZE 1024
   int arr[SIZE][SIZE];

   void cmTraverse ( void ) {
      int i, j;
      for (i = 0; i < SIZE; i++)
         for (j = 0; j < SIZE; j++)
            arr[i][j] = (i + j);
   }

   void rmTraverse ( void ) {
      int i, j;
      for (i = 0; i < SIZE; i++)
         for (j = 0; j < SIZE; j++)
            arr[j][i] = (i + j);
   }

   int main ( int argc, char** argv ) {
      printf ( "hello, world!\n" );
      cmTraverse();
      rmTraverse();
      return 0;
   }

The program traverses a big square array in both row-major and column-major fashion.  cmTraverse and rmTraverse look almost identical, but their cache friendlyness is dramatically different.   We now compile and run the program in the normal way, except that the compile command is presented to cacheprof:

   sewardj@muraroa:~$ cacheprof gcc -O -Wall -o example example.c

   sewardj@muraroa:~$ ./example
   ==cacheprof== level-2 instrumented program: startup
   hello, world!
   ==cacheprof== level-2 instrumented program: run complete
   ==cacheprof==    12,594,205 insns
   ==cacheprof==     2,097,171 refs   (            9 rd +     2,097,162 wr)
   ==cacheprof==     1,179,654 misses (            3 rd +     1,179,651 wr)
   ==cacheprof== 0.93 seconds, 13.54 MIPS

The program prints some handy summary stats, and dumps detailed info into cacheprof.out.  Running 'cacheprint example.c' then gives the following (you may need to make your window wider, and get out your magnifying glass):

   12,594,205 instructions,  2,097,171 references,  1,179,654 misses.

     SIZE          RD-&-WR      %          MISSES      %      RATE
   ---------------------------------------------------------------
   TOTAL:        2,097,171  100.%       1,179,654  100.%     56.2%
   1 byte                0   0.0%               0   0.0%      NaN%
   2 byte                0   0.0%               0   0.0%      NaN%
   4 byte        2,097,171  100.%       1,179,654  100.%     56.2%
   8 byte                0   0.0%               0   0.0%      NaN%

     SIZE            READS      %          MISSES      %      RATE
   ---------------------------------------------------------------
   TOTAL:                9  100.%               3  100.%     33.3%
   1 byte                0   0.0%               0   0.0%      NaN%
   2 byte                0   0.0%               0   0.0%      NaN%
   4 byte                9  100.%               3  100.%     33.3%
   8 byte                0   0.0%               0   0.0%      NaN%

     SIZE           WRITES      %          MISSES      %      RATE
   ---------------------------------------------------------------
   TOTAL:        2,097,162  100.%       1,179,651  100.%     56.2%
   1 byte                0   0.0%               0   0.0%      NaN%
   2 byte                0   0.0%               0   0.0%      NaN%
   4 byte        2,097,162  100.%       1,179,651  100.%     56.2%
   8 byte                0   0.0%               0   0.0%      NaN%

   1 source files, 3 function names, and 21 program points
   are mentioned in `cacheprof.out'.
 

   FUNCTION NAME               RD-&-WR      %  cum-%       MISSES      %  cum-%      RATE
   --------------------------------------------------------------------------------------
   rmTraverse(example.c)    1,048,583  49.9%  49.9%    1,048,579  88.8%  88.8%     99.9%
   cmTraverse(example.c)     1,048,581  49.9%  99.9%      131,074  11.1%  99.9%     12.5%
   main(example.c)                   7   0.0%  100.%            1   0.0%  100.%     14.2%

   The remaining 0 functions contributed less than 0.01% of the reads-&-writes,
   and less than 0.01% of the misses.
 

--------------------------------------------------------------------------------------------
--- Annotated source: example.c ------------------------------------------------------------
--------------------------------------------------------------------------------------------
 

                        #include <stdio.h>

                        #define SIZE 1024
                        int arr[SIZE][SIZE];

                        void cmTraverse ( void )
          2          0  {
                           int i, j;
                           for (i = 0; i < SIZE; i++)
                              for (j = 0; j < SIZE; j++)
  1,048,576    131,073           arr[i][j] = (i + j);
          3          1  }

                        void rmTraverse ( void )
          3          1  {
                           int i, j;
                           for (i = 0; i < SIZE; i++)
                              for (j = 0; j < SIZE; j++)
  1,048,576  1,048,576           arr[j][i] = (i + j);
          4          2  }

                        int main ( int argc, char** argv )
          1          1  {
          2          0     printf ( "hello, world!\n" );
          1          0     cmTraverse();
          1          0     rmTraverse();
                           return 0;
          2          0  }

The sordid outcome of this investigation is that one of the row-major traversal is far worse than column-major traversal (I think I have the row- and column- descriptions the wrong way round, doesn't matter).  The particular array sizes were chosen so as to illustrate the worst-possible behaviour for the simulated cache, which is the L1 data cache for an AMD K6-2.  The miss rates differ by a factor of 8.  Running the (uninstrumented) program 100 times on a real K6, the standard gprof profiling tool indicates that ...

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 68.64     11.93    11.93      100   119.30   119.30  rmTraverse
 31.36     17.38     5.45      100    54.50    54.50  cmTraverse
  0.00     17.38     0.00        1     0.00 17380.00  main

ie the real timing ratio is about 2.2:1 -- not as dramatic as the 8:1 difference in miss rates, probably since (1) executing instructions also takes time (alas :-), and (2) both procedures also miss badly on my weedy (Socket-7-typical) 512k second-level cache (arr is 4 megabytes in size).  Perhaps I should have simulated that instead.  Ah well.

Finally, timing the uninstrumented program shows it runs in about 0.17 second, and since we know it runs for 12.594 million instructions, the execution rate is about 72 MIPS, which is pretty appalling for a 350MHz K6-2.  That just confirms that this program does very badly in the cache -- by comparison, most "normal" programs run at between 150 and 250 MIPS on this machine.  (note.  MIPS == Meaningless Indication of Processor Speed.  Never forget that).
 

7  How cacheprof came to be

After the initial release of bzip2-0.1pl2 in August 1997, I spent a lot of time trying to improve bzip2's dismal compression speed, without much success.  I fairly rapidly came to the conclusion that memory effects had a large bearing on performance.  Any decent book on computer architecture will tell you that, but discovering that it afflicts one's own programs really bought it home to me.

I started to look around for open-source tools which I could use to quantify memory/cache effects in bzip2, and was totally amazed not to find any.  Having read many papers about sorting algorithms (the slow part of bzip2) I was also surprised to discover that the (academic) algorithm community places a lot of emphasis on developing algorithms with lower instruction counts, but almost no emphasis on designing more cache-efficient algorithms.  I find this attitude quite illogical, given the relative costs of instructions vs cache misses on contemporary architectures.

For a long time I mused on how to build a cache profiler, but did nothing.

In early summer '99, I knocked up a new back end to the lcc C compiler.  The back end emitted C (bizarrely), but instrumented so that memory references were passed to a cache simulator.  It was a terrible hack, but it worked well enough to tell me interesting things about bzip2.  For me, it "proved the concept", but wasn't a satisfactory solution.  It was tied to C, to the specific C compiler, and didn't catch all references.  I wasn't sure how to proceed.

Later in the summer, my excellent colleague Simon Marlow suggested that it would be much simpler to parse and annotate assembly code.  This was the critical idea which made cacheprof practical.  Many weekends and late nights later, this is the result.