Valgrind 3.21 is out

Valgrind 3.21 was released last week. The release announcement has the full details of the changes.

My contribution

My contribution to this release is that I rewrote the cg_annotate, cg_diff, and cg_merge programs in Python, and greatly improved the output produced by cg_annotate.

The rewritten programs

When you run a program under Cachegrind it produces an output file containing profiling data. These three programs do post-processing.

cg_annotate processes a Cachegrind output file and produces human-readable output, including source code annotated with event counts. I wrote the original version in 2002 in Perl. At that time, Perl was a decent choice for a program that reads in some structured text input, does some processing, and then produces text output. In 2023? Not so much.
cg_diff diffs two Cachegrind output files, producing a third Cachegrind output file you can then run cg_annotate on. It’s great for comparing the performance of two slightly different versions of a program. I wrote the original version in 2010. It’s also in Perl, and it’s very much a cut-down-and-modified version of cg_annotate.
cg_merge combines two or more Cachegrind output files, producing another Cachegrind output file you can then run cg_annotate on. Julian Seward wrote it in 2007 in C. I believe he wrote it in C because he didn’t know Perl.

Problems with the old versions

Recently I had some ideas how to improve cg_annotate’s output to be more useful. But they would have required some major changes to the code. I didn’t want to do this for few reasons.

First, I now find Perl to be an unpleasant language, and my knowledge of it has withered from lack of use. Reading the existing code was painful — particularly when trying to remember what all those sigils ($, @, %, \) mean in different contexts — and the thought of major modifications was even worse. Plus, Perl is clearly a dying language.

Second, the code wasn’t well structured. There were lots of global variables, and the code had a linear “do this, then that, then that” style with very few data or control abstractions. I am a better programmer today than I was in 2002.

So I decided to bite the bullet and rewrite these three programs in Python. (Why not Rust? Because Valgrind already depends on Python, but doesn’t depend on Rust, and I didn’t want to add a Rust dependency.) It turned out really well. I am no Python expert, but I find it a much nicer language to use than Perl.

The rewriting process

I started by transliterating the Perl into Python, and getting it to produce the exact same output. I then gradually refactored it, improving the structure and getting rid of most of the global variables. After that, I was able to introduce the functionality changes I wanted without much difficulty.

I then followed a similar process with cg_diff and cg_merge. Although the old cg_merge was written in C, it has a lot of overlap with cg_diff so the Python version was easy to get working.

I was initially worried that the Python versions would be slower than the Perl and C versions. But even the naive initial Python versions of cg_annotate and cg_diff were about 1.5x faster than the Perl versions, and after some profiling and optimisation they are now something like 3-4x faster. (And even faster if you use pypy.) The initial version of cg_merge was about 3x slower than the C version. The final version is certainly faster, though I haven’t measured it. I’m not worried about its performance because cg_merge is the least used of the three programs.

The Python version of cg_merge is about 5x less code than the C version! (To be fair, a large chunk of the C code was an implementation of a balanced binary tree copied from somewhere else.)

I’m generally happy with the final performance, but still a little frustrated on one point. The main data structure used by these programs is a cost centre. Each cost centre contains a list[int], holding one or more event counts for a file, function, or line. I originally implemented this as a class called Cc. I implemented the __iadd__ and __isub__ methods so I could use += and -= to add/subtract one Cc to/from another. Later I tried changing Cc to just a typedef, which required changing those methods to top-level functions add_cc_to_cc and sub_cc_from_cc. This gave me the single biggest speed-up of any change I made, reducing runtime by something like 30%! In this case I decided the performance improvement was worth making the code a little uglier, but it’s a shame that a fundamental language operation like an object method call is that much slower than a top-level function call.

After all this I realised that there wasn’t any need for these three programs to be separate, because there is plenty of overlap between them. So I added diff and merge capabilities to cg_annotate. I also marked cg_diff and cg_merge as deprecated, though they still exist for now.

Python tools

Along the way I was delighted to learn about some of the great Python tooling that is available.

I use isort and black to auto-format the code. I like auto-formatting. It saves time and spares me from constant decision-making about code style, conserving mental energy for more important aspects of the code.
I use mypy and pyright to type-check the code. I use two different type checkers because I found that each one misses some things that the other finds. I love static typing. It also saves so much time, particularly when refactoring code. (“All writing is rewriting” applies to code, too.) It’s great to see how well static typing works in Python nowadays.
I use ruff and pylint for linting. These occasionally find real problems. pylint required configuration to disable certain lints that I don’t like.
I use cProfile (with snakeviz) and Scalene for profiling. Both are useful, and it’s particularly nice that Scalene has both line-level profiling and memory profiling. Scalene also has AI-powered optimization suggestions but they didn’t help me. As I mentioned above, cg_annotate deals with many lists of integers, but they’re all very short. (The most common length is one.) The AI mostly kept suggesting “use NumPy here to parallelize this operation”. Some of these suggestions were plausible but wouldn’t have helped performance, and some were in places where NumPy use didn’t even make sense. The robots haven’t replaced me yet.

One thing I haven’t mentioned is packaging. All three programs are implemented in a single file and only rely on the Python Standard Library. This means I can use cp as the package manager, which avoids a whole lot of potential headaches!

Better cg_annotate output

The end result of all this is that cg_annotate has better output, particular for programs that use inlining a lot. Consider this small example from profiling the Rust compiler:

> 273,215,023  (6.8%, 18.5%)  <rustc_middle::ty::fast_reject::DeepRejectCtxt>::substs_refs_may_unify:
  148,506,803  (3.7%)           /home/njn/dev/rust0/compiler/rustc_middle/src/ty/fast_reject.rs
   42,562,331  (1.1%)           /home/njn/dev/rust0/library/core/src/iter/adapters/zip.rs
   23,802,697  (0.6%)           /home/njn/dev/rust0/library/core/src/iter/traits/iterator.rs
   21,218,148  (0.5%)           /home/njn/dev/rust0/compiler/rustc_middle/src/ty/subst.rs
   15,909,582  (0.4%)           /home/njn/dev/rust0/library/core/src/cmp.rs
   10,609,074  (0.3%)           /home/njn/dev/rust0/library/core/src/iter/adapters/copied.rs
   10,606,388  (0.3%)           /home/njn/dev/rust0/compiler/rustc_middle/src/ty/list.rs

This is the entry for a single compiled function called substs_ref_may_unify. It accounts for 6.8% of the instructions executed in this run of the compiler. Because of inlining, that single compiled function includes code from at least seven different Rust source functions: some in the compiler itself, and some in the standard library. With the new output this is easy to see. With the old output these seven entries were scattered throughout the output. It was easy to overlook the connection between them, and thus underestimate the importance of substs_ref_may_unify.

It’s the kind of change that might seem underwhelming if you don’t use Cachegrind, but is really useful if you use it a lot on large prorams. I have found it very useful so far and I’m looking forward to using extensively in the future.

Notes on Valgrind development

This was the first major change I’ve made to Valgrind in some time. Having become used to Rust compiler development, there are a number of things that now frustrate me about Valgrind development.

First, Valgrind’s development is scattered across a number of systems.

The git repository is hosted on Sourceware.
CI is provided by Sourceware.
The website is hosted on Sourceware.
Bug tracking is hosted on the KDE Bugzilla instance.
The mailing lists are hosted on Sourceforge.
IRC is hosted on libera.chat.

Second, Valgrind development processes are quite old-fashioned.

There are no explicit governance mechanisms or decision-making processes.
The workflow for new contributors is primitive, and involves either sending a patch to the mailing list or submitting a patch to a bug report in Bugzilla. Either way, the patch must be merged by someone with commit privileges.
People with commit privileges can commit anything at any time.
- There are no pre-commit code review requirements.
- There is no mandatory pre-commit testing. (Pre-commit try pushes are newly possible. Post-commit CI test runs also occur.) Commits that break things are common.
Some platforms consistently fail CI testing.
Code style is inconsistent and actively bad in many ways (3 space indents!) with no use of auto-formatting (other than the new Python programs mentioned above).
Bug reports receive some level of attention, but less than would be ideal.
Documentation is done with DocBook, which is powerful but requires writing XML. (It’s no fun writing <computeroutput>text</computeroutput> instead of `text` like you would in Markdown.)
There is no code of conduct.

Much of this is due to age. When Valgrind was created 20 years ago a lot of today’s niceties — things you get with little or no effort when you start a new project on GitHub — didn’t exist. The above items are not of equal importance, but I think several of them incur significant risk to Valgrind’s long-term health. Which ones could be improved?

Third, given that Valgrind is a generic framework for creating dynamic analysis tools, those tools are quite hard to write and distribute.

Although Valgrind’s core provides a lot of functionality, writing a new tool of any complexity is still somewhat painful because they are all written in C. Many things that are easy in other languages are hard work in C, due to the lack of protection against bugs and the impoverished standard library. (A tiny example: DHAT produces JSON output, which requires an implementation of JSON string escaping, and string processing is no fun at all in C.) Also, Valgrind’s IR is very clean for one written C, but you could do much better in a language with algebraic data types. I’ve had several ideas for experimental profiling tools over the years that I haven’t implemented because the effort required was too high to get even a prototype working.
Tools that are within Valgrind’s repository are first-class citizens and are distributed with Valgrind. Beyond that, there is no real distribution mechanism. The most practical way to distribute an external tool is as a patch, with instructions saying “apply this patch to the Valgrind source code and then build it yourself”.

Here’s an interesting question: what changes to Valgrind would be required to allow a new tool be written in Rust and distributed separately from the core Valgrind distribution?