How to speed up the Rust compiler in March 2023

Welcome to the fifteenth post in my long-running “How to speed up the Rust compiler” series, and the first in 2023.

Large improvements

Let’s start with some improvements made by others. The information about metrics at the top of my last post still applies.

#107449: In this PR, @saethlin enabled an existing MIR optimization pass called CopyProp. This gave a mean wall-time reduction of 0.54% across all benchmark results, and 0.83% across all optimized primary benchmark results. This was one of those changes that was fairly simple in the end, but took a lot of digging to uncover. (The linked Mastodon thread has information about various other MIR improvements that @saethlin made.)

#108815: In this PR, @the8472 sped up obligation processing, for wall-time reductions of 10-12% on keccak and 1-3% on cranelift-codegen. Impressively, this was achieved by adding a really fast path on top of code that already had a heavily optimized fast path.

LLVM IR

#103331: In this PR I changed the LLVM IR generated by the rustc front-end for debug builds to use br instead of a switch with two targets, when possible. This gave surprisingly good compile time reductions, with a mean reduction in cycles of 1.16% across all debug primary benchmark results, and maximum reductions of over 5%. The reason for the improvement is that switch instructions aren’t handled by LLVM’s FastISel instruction selector, so this change reduces the number of times that LLVM must fall back to the slower, more general SelectionDAG instruction selector.

#103138: In this PR I changed the LLVM IR generated by the rustc front-end to use bigger basic blocks. The front-end was generating many small basic blocks that would unconditionally branch to another basic block. For example, if we have block A and block B, and the only way to enter B is from an unconditional branch at the end of A, A and B can be merged. This change reduced the number of br (unconditional branch) instructions in the generated code by about 40%. Disappointingly, this had little effect on compile times, but it did cause a mean reduction in binary size of 0.37% across all debug primary benchmark results. It also makes the generated LLVM IR easier to read and debug.

AST shrinkage

In my last post I mentioned various attempts to shrink AST/HIR/THIR nodes, and how they were less effective than I’d hoped. Since then I got a couple more wins from additional AST shrinkage.

#101562: In this PR I shrank the size of AST expression nodes from 104 bytes to 72 bytes. Expressions are by far the most common type of AST node, and the size reduction was large (30.8%), so this is something of a limit test for the improvements possible in practice from this kind of change. On primary benchmarks this gave several sub-3% wall-time wins. On secondary benchmarks, which include a number of stress tests with a lot of code, it gave up to to 6% wall-time wins.

#104754: In this PR I changed a lot of Vec occurrences in AST nodes to ThinVec, which stores the length and capacity of non-empty vectors on the heap next to the vector’s elements. This means that the size of a ThinVec struct is just one word, compared to three words for Vec. This makes it a good choice for vectors that are often empty, and it can also be used to shrink the largest variants of an enum, if those variants contains a Vec. This change reduced the size of many AST nodes, such as:

Item: 184 to 136 bytes;
Fn: 184 to 152 bytes;
Pat: 88 to 72 bytes;
Generics: 72 to 40 bytes;
Block: 48 to 32 bytes. This gave sub-3% icount reductions across many benchmarks.

The results on these two PRs were good, but both of them were a pain, requiring multiple iterations to get clear performance wins. #101562 took over two months between the PR being opened and it being merged. #104754 took almost three months.

Finally, I tried two more related changes that didn’t result in speed-ups. In #108387 I further shrank AST expressions from 72 to 64 bytes. I hoped this would be a win because 64 bytes is the typical size of a cache line, but it made no measurable difference to speed while making the AST code a little more complex. In #108332 I tried changing the length and capacity fields of ThinVec from usize to u32, but that also had negligible effects.

Lints

The compiler has numerous lints that check for suspect code patterns. These require traversing the AST and HIR, which takes time. I thought at first that each lint was traversing the AST/HIR separately, and that time could be saved by merging them into fewer traversals. It turned out this had already been done, though the macro-heavy structure of the code made it non-obvious.

I also saw some opportunities for code simplifications, which I did in #104863. Unfortunately this caused small but widespread regressions in lint performance. I tried to fix these regressions in #105291 and #105416 with minimal success. I finally realized that I had misunderstood one part of how lint traversals were structured, and then I fixed the regression properly in #105485, which required reverting all of #105291 and part of #104863.

In the end, there were some code simplifications and no performance improvements, but it is an instructive story. The lessons being that (a) the most experienced of us still make mistakes, and (b) macro-heavy code can be hard to understand.

Miscellaneous performance changes

#105121: The compiler has a debugging option -Zdump-mir. The code implementing it was causing some allocations even when it was disabled. In #105083 I avoided these allocations, but then @oli-obk reworked my code into a more comprehensive version in this PR, for sub-2% icount reductions across a few benchmarks.

#105436: In this PR I inlined a hot function, which gave a small win for the aws-sdk-ec2 crate.

#107627 In this PR I micro-optimized the hot fold_ty methods in three different folding passes: OpportunisticVarResolver, ShallowResolver, and TypeFreshener, for sub-2% icount wins across a couple of benchmarks.

#107717: In this PR I tweaked the implementation of TyKind::eq, for sub-2% icount wins across a couple of benchmarks.

#107869, #108020: In these PRs I avoided some type and region interning, combining for 1-2% icount reductions on a couple of stress tests, and sub-1% icount wins on some other benchmarks.

rustc-perf #1522: In this PR I added support for profiling the rustc-perf benchmarks with samply, an excellent sampling-based profiler. I’ve found this support useful on several occasions since.

#108692: In this PR I removed a redundant computation of a temporary scope during THIR building, for yet more sub-2% icount reductions across numerous benchmarks.

#108794: The compiler has code to compute a “crate hash” for each crate, which involves hashing the internal representation of the crate’s code. This crate hash is used in a few scenarios, such as during incremental compilation and when generating metadata for library crates. It is expensive to compute and there are some scenarios where it isn’t needed, such as non-incremental builds of binary crates. In this PR I avoided computing the crate hash when possible, for sub-3% icount reductions across many primary benchmarks, and some larger reductions across secondary benchmarks, including one of over 11%.

Cachegrind improvements

I use Cachegrind a lot when profiling the compiler. Cachegrind has a script called cg_annotate that reads a profile and produces human-readable output. Annoyed by some shortcomings in this output, I tried some hacky pre-processing of Cachegrind profiles to improve the output. This change led directly to the final two PRs in the previous section.

So I wanted to make this change a permanent part of cg_annotate, and I have other ideas to improve the output further. However, there was a major problem: cg_annotate was written in Perl. This was a reasonable choice when I originally wrote it in 2002, but it’s not good in 2023. Also, the code structure was poor, with lots of global variables and some very long functions. As a result, cg_annotate was difficult to modify, despite not being that complicated.

The solution? I rewrote it in Python. (Why not Rust? Valgrind already depends on Python, but doesn’t depend on Rust, and I didn’t want to add a Rust dependency.) The Python version is a little shorter and runs twice as fast. More importantly, it is also much easier to modify, due to (a) Python being nicer than Perl (especially the support for static types) and (b) better code structure (I’m a better programmer today than I was in 2002).

I have since made some small functionality changes to the new cg_annotate, and I will make larger ones soon. This will hopefully lead to more rustc speed improvements in the future.

Other non-performance things

Since my last post I also did a few significant things unrelated to compiler performance.

#104429: In this PR I made a small language change that affects how derive is implemented for packed structs. The new approach is simpler, works with generic packed structs, and works with packed structs that impl Copy without explicitly deriving Copy. The new approach also fails to compile some obscure cases that the old approach did, and a crater run showed a handful of crates were affected by this. This breakage was deemed acceptable by the lang team because this change ties in with the removal of some language unsoundness caused by unaligned references. In particular, this change unbreaks compilation of version 0.2 of the popular winapi crate, which the unaligned references work would have otherwise broken.

#101841: I finally merged this PR removing -Zsave-analysis, an experimental compiler feature long slated for removal. This took a while because we decided to give everyone an additional three months notice to prepare for this change.

I took part in an episode of the Software Unscripted podcast with Richard Feldman called “Speeding up Rust’s Compiler”.

I also had six weeks of vacation over Christmas, which was my first long stretch of time off in two years, hooray!

General Progress

For the period 2022-10-25 to 2023-03-22 we had some good results on wall-time measurements.

There were 523 results measured across 41 benchmarks.
Of the 247 changes that met the significance threshold, 188 were improvements, and 59 were regressions. The mean change across these results was a reduction of 6.80%.
If we consider all results, 332 were improved, 191 were worse, and the mean result was a reduction of 2.98%.

As always, these measurements are done on Linux.

This is a smaller improvement than we saw in my last blog post. However, compilers tend to get slower over time if careful effort isn’t made, so I still consider this a positive outcome. Thank you to everyone who contributed!