Welcome to the fifteenth post in my long-running “How to speed up the Rust compiler” series, and the first in 2023.
Let’s start with some improvements made by others. The information about metrics at the top of my last post still applies.
#107449: In this PR, @saethlin enabled an existing MIR optimization pass called CopyProp. This gave a mean wall-time reduction of 0.54% across all benchmark results, and 0.83% across all optimized primary benchmark results. This was one of those changes that was fairly simple in the end, but took a lot of digging to uncover. (The linked Mastodon thread has information about various other MIR improvements that @saethlin made.)
#108815: In this PR,
@the8472 sped up obligation processing, for
wall-time reductions of 10-12% on
keccak and 1-3% on
Impressively, this was achieved by adding a really fast path on top of code
that already had a heavily optimized fast path.
#103331: In this PR I changed
the LLVM IR generated by the rustc front-end for debug builds to use
br instead of a
switch with two targets, when possible. This gave
surprisingly good compile time reductions, with a mean reduction in cycles of
1.16% across all debug primary benchmark results, and maximum reductions of
over 5%. The
for the improvement is that
switch instructions aren’t handled by LLVM’s
FastISel instruction selector, so this change reduces the number of times that
LLVM must fall back to the slower, more general SelectionDAG instruction
#103138: In this PR I changed
the LLVM IR generated by the rustc front-end to use bigger basic
blocks. The front-end was
generating many small basic blocks that would unconditionally branch to another
basic block. For example, if we have block
A and block
B, and the only way
B is from an unconditional branch at the end of
be merged. This change reduced the number of
br (unconditional branch)
instructions in the generated code by about 40%. Disappointingly, this had
little effect on compile times, but it did cause a mean reduction in binary
size of 0.37% across all debug primary benchmark results. It also makes the
generated LLVM IR easier to read and debug.
In my last post I mentioned various attempts to shrink AST/HIR/THIR nodes, and how they were less effective than I’d hoped. Since then I got a couple more wins from additional AST shrinkage.
#101562: In this PR I shrank the size of AST expression nodes from 104 bytes to 72 bytes. Expressions are by far the most common type of AST node, and the size reduction was large (30.8%), so this is something of a limit test for the improvements possible in practice from this kind of change. On primary benchmarks this gave several sub-3% wall-time wins. On secondary benchmarks, which include a number of stress tests with a lot of code, it gave up to to 6% wall-time wins.
#104754: In this PR I changed
a lot of
Vec occurrences in AST nodes to
ThinVec, which stores the length and
capacity of non-empty vectors on the heap next to the vector’s elements. This
means that the size of a
ThinVec struct is just one word, compared to three
Vec. This makes it a good choice for vectors that are often empty,
and it can also be used to shrink the largest variants of an enum, if those
variants contains a
Vec. This change reduced the size of many AST nodes, such
Item: 184 to 136 bytes;
Fn: 184 to 152 bytes;
Pat: 88 to 72 bytes;
Generics: 72 to 40 bytes;
Block: 48 to 32 bytes. This gave sub-3% icount reductions across many benchmarks.
The results on these two PRs were good, but both of them were a pain, requiring multiple iterations to get clear performance wins. #101562 took over two months between the PR being opened and it being merged. #104754 took almost three months.
Finally, I tried two more related changes that didn’t result in speed-ups. In
#108387 I further shrank AST
expressions from 72 to 64 bytes. I hoped this would be a win because 64 bytes
is the typical size of a cache line, but it made no measurable difference to
speed while making the AST code a little more complex. In
#108332 I tried changing the
length and capacity fields of
u32, but that also
had negligible effects.
The compiler has numerous lints that check for suspect code patterns. These require traversing the AST and HIR, which takes time. I thought at first that each lint was traversing the AST/HIR separately, and that time could be saved by merging them into fewer traversals. It turned out this had already been done, though the macro-heavy structure of the code made it non-obvious.
I also saw some opportunities for code simplifications, which I did in #104863. Unfortunately this caused small but widespread regressions in lint performance. I tried to fix these regressions in #105291 and #105416 with minimal success. I finally realized that I had misunderstood one part of how lint traversals were structured, and then I fixed the regression properly in #105485, which required reverting all of #105291 and part of #104863.
In the end, there were some code simplifications and no performance improvements, but it is an instructive story. The lessons being that (a) the most experienced of us still make mistakes, and (b) macro-heavy code can be hard to understand.
Miscellaneous performance changes
#105121: The compiler has a
-Zdump-mir. The code implementing it was causing some
allocations even when it was disabled. In
#105083 I avoided these
allocations, but then @oli-obk reworked my code
into a more comprehensive version in this PR, for sub-2% icount reductions
across a few benchmarks.
#105436: In this PR I inlined
a hot function, which gave a small win for the
#107627 In this PR I
micro-optimized the hot
fold_ty methods in three different folding passes:
TypeFreshener, for sub-2%
icount wins across a couple of benchmarks.
#107717: In this PR I tweaked
the implementation of
TyKind::eq, for sub-2% icount wins across a
couple of benchmarks.
#108692: In this PR I removed a redundant computation of a temporary scope during THIR building, for yet more sub-2% icount reductions across numerous benchmarks.
#108794: The compiler has code to compute a “crate hash” for each crate, which involves hashing the internal representation of the crate’s code. This crate hash is used in a few scenarios, such as during incremental compilation and when generating metadata for library crates. It is expensive to compute and there are some scenarios where it isn’t needed, such as non-incremental builds of binary crates. In this PR I avoided computing the crate hash when possible, for sub-3% icount reductions across many primary benchmarks, and some larger reductions across secondary benchmarks, including one of over 11%.
I use Cachegrind a lot when profiling the compiler. Cachegrind has a script
cg_annotate that reads a profile and produces human-readable output.
Annoyed by some shortcomings in this output, I tried some hacky pre-processing
of Cachegrind profiles to improve the output. This change led directly to the
final two PRs in the previous section.
So I wanted to make this change a permanent part of
cg_annotate, and I have
other ideas to improve the output further. However, there was a major problem:
cg_annotate was written in Perl. This was a reasonable choice when I
originally wrote it in 2002, but it’s not good in 2023. Also, the code
structure was poor, with lots of global variables and some very long functions.
As a result,
cg_annotate was difficult to modify, despite not being that
The solution? I rewrote it in Python. (Why not Rust? Valgrind already depends on Python, but doesn’t depend on Rust, and I didn’t want to add a Rust dependency.) The Python version is a little shorter and runs twice as fast. More importantly, it is also much easier to modify, due to (a) Python being nicer than Perl (especially the support for static types) and (b) better code structure (I’m a better programmer today than I was in 2002).
I have since made some small functionality changes to the new
and I will make larger ones soon. This will hopefully lead to more rustc speed
improvements in the future.
Other non-performance things
Since my last post I also did a few significant things unrelated to compiler performance.
#104429: In this PR I made a
small language change that affects how
derive is implemented for packed
structs. The new approach is simpler, works with generic packed structs, and
works with packed structs that impl
Copy without explicitly deriving
The new approach also fails to compile some obscure cases that the old approach
did, and a crater run showed a handful of crates were affected by this. This
breakage was deemed acceptable by the lang team because this change ties in
with the removal of some language unsoundness caused by unaligned
references. In particular, this
change unbreaks compilation of version 0.2 of the popular
winapi crate, which
the unaligned references work would have otherwise broken.
#101841: I finally merged this
-Zsave-analysis, an experimental compiler feature long slated for
removal. This took a while because we decided to give everyone an additional
three months notice to prepare for this change.
I also had six weeks of vacation over Christmas, which was my first long stretch of time off in two years, hooray!
For the period 2022-10-25 to 2023-03-22 we had some good results on wall-time measurements.
- There were 523 results measured across 41 benchmarks.
- Of the 247 changes that met the significance threshold, 188 were improvements, and 59 were regressions. The mean change across these results was a reduction of 6.80%.
- If we consider all results, 332 were improved, 191 were worse, and the mean result was a reduction of 2.98%.
As always, these measurements are done on Linux.
This is a smaller improvement than we saw in my last blog post. However, compilers tend to get slower over time if careful effort isn’t made, so I still consider this a positive outcome. Thank you to everyone who contributed!