How to speed up the Rust compiler in March 2023
Welcome to the fifteenth post in my long-running “How to speed up the Rust compiler” series, and the first in 2023.
Large improvements
Let’s start with some improvements made by others. The information about metrics at the top of my last post still applies.
#107449: In this PR, @saethlin enabled an existing MIR optimization pass called CopyProp. This gave a mean wall-time reduction of 0.54% across all benchmark results, and 0.83% across all optimized primary benchmark results. This was one of those changes that was fairly simple in the end, but took a lot of digging to uncover. (The linked Mastodon thread has information about various other MIR improvements that @saethlin made.)
#108815: In this PR,
@the8472 sped up obligation processing, for
wall-time reductions of 10-12% on keccak
and 1-3% on cranelift-codegen
.
Impressively, this was achieved by adding a really fast path on top of code
that already had a heavily optimized fast path.
LLVM IR
#103331: In this PR I changed
the LLVM IR generated by the rustc front-end for debug builds to use
br
instead of a switch
with two targets, when possible. This gave
surprisingly good compile time reductions, with a mean reduction in cycles of
1.16% across all debug primary benchmark results, and maximum reductions of
over 5%. The
reason
for the improvement is that switch
instructions aren’t handled by LLVM’s
FastISel instruction selector, so this change reduces the number of times that
LLVM must fall back to the slower, more general SelectionDAG instruction
selector.
#103138: In this PR I changed
the LLVM IR generated by the rustc front-end to use bigger basic
blocks. The front-end was
generating many small basic blocks that would unconditionally branch to another
basic block. For example, if we have block A
and block B
, and the only way
to enter B
is from an unconditional branch at the end of A
, A
and B
can
be merged. This change reduced the number of br
(unconditional branch)
instructions in the generated code by about 40%. Disappointingly, this had
little effect on compile times, but it did cause a mean reduction in binary
size of 0.37% across all debug primary benchmark results. It also makes the
generated LLVM IR easier to read and debug.
AST shrinkage
In my last post I mentioned various attempts to shrink AST/HIR/THIR nodes, and how they were less effective than I’d hoped. Since then I got a couple more wins from additional AST shrinkage.
#101562: In this PR I shrank the size of AST expression nodes from 104 bytes to 72 bytes. Expressions are by far the most common type of AST node, and the size reduction was large (30.8%), so this is something of a limit test for the improvements possible in practice from this kind of change. On primary benchmarks this gave several sub-3% wall-time wins. On secondary benchmarks, which include a number of stress tests with a lot of code, it gave up to to 6% wall-time wins.
#104754: In this PR I changed
a lot of Vec
occurrences in AST nodes to
ThinVec
, which stores the length and
capacity of non-empty vectors on the heap next to the vector’s elements. This
means that the size of a ThinVec
struct is just one word, compared to three
words for Vec
. This makes it a good choice for vectors that are often empty,
and it can also be used to shrink the largest variants of an enum, if those
variants contains a Vec
. This change reduced the size of many AST nodes, such
as:
Item
: 184 to 136 bytes;Fn
: 184 to 152 bytes;Pat
: 88 to 72 bytes;Generics
: 72 to 40 bytes;Block
: 48 to 32 bytes. This gave sub-3% icount reductions across many benchmarks.
The results on these two PRs were good, but both of them were a pain, requiring multiple iterations to get clear performance wins. #101562 took over two months between the PR being opened and it being merged. #104754 took almost three months.
Finally, I tried two more related changes that didn’t result in speed-ups. In
#108387 I further shrank AST
expressions from 72 to 64 bytes. I hoped this would be a win because 64 bytes
is the typical size of a cache line, but it made no measurable difference to
speed while making the AST code a little more complex. In
#108332 I tried changing the
length and capacity fields of ThinVec
from usize
to u32
, but that also
had negligible effects.
Lints
The compiler has numerous lints that check for suspect code patterns. These require traversing the AST and HIR, which takes time. I thought at first that each lint was traversing the AST/HIR separately, and that time could be saved by merging them into fewer traversals. It turned out this had already been done, though the macro-heavy structure of the code made it non-obvious.
I also saw some opportunities for code simplifications, which I did in #104863. Unfortunately this caused small but widespread regressions in lint performance. I tried to fix these regressions in #105291 and #105416 with minimal success. I finally realized that I had misunderstood one part of how lint traversals were structured, and then I fixed the regression properly in #105485, which required reverting all of #105291 and part of #104863.
In the end, there were some code simplifications and no performance improvements, but it is an instructive story. The lessons being that (a) the most experienced of us still make mistakes, and (b) macro-heavy code can be hard to understand.
Miscellaneous performance changes
#105121: The compiler has a
debugging option -Zdump-mir
. The code implementing it was causing some
allocations even when it was disabled. In
#105083 I avoided these
allocations, but then @oli-obk reworked my code
into a more comprehensive version in this PR, for sub-2% icount reductions
across a few benchmarks.
#105436: In this PR I inlined
a hot function, which gave a small win for the aws-sdk-ec2
crate.
#107627 In this PR I
micro-optimized the hot fold_ty
methods in three different folding passes:
OpportunisticVarResolver
, ShallowResolver
, and TypeFreshener
, for sub-2%
icount wins across a couple of benchmarks.
#107717: In this PR I tweaked
the implementation of TyKind::eq
, for sub-2% icount wins across a
couple of benchmarks.
#107869, #108020: In these PRs I avoided some type and region interning, combining for 1-2% icount reductions on a couple of stress tests, and sub-1% icount wins on some other benchmarks.
rustc-perf #1522: In this PR I added support for profiling the rustc-perf benchmarks with samply, an excellent sampling-based profiler. I’ve found this support useful on several occasions since.
#108692: In this PR I removed a redundant computation of a temporary scope during THIR building, for yet more sub-2% icount reductions across numerous benchmarks.
#108794: The compiler has code to compute a “crate hash” for each crate, which involves hashing the internal representation of the crate’s code. This crate hash is used in a few scenarios, such as during incremental compilation and when generating metadata for library crates. It is expensive to compute and there are some scenarios where it isn’t needed, such as non-incremental builds of binary crates. In this PR I avoided computing the crate hash when possible, for sub-3% icount reductions across many primary benchmarks, and some larger reductions across secondary benchmarks, including one of over 11%.
Cachegrind improvements
I use Cachegrind a lot when profiling the compiler. Cachegrind has a script
called cg_annotate
that reads a profile and produces human-readable output.
Annoyed by some shortcomings in this output, I tried some hacky pre-processing
of Cachegrind profiles to improve the output. This change led directly to the
final two PRs in the previous section.
So I wanted to make this change a permanent part of cg_annotate
, and I have
other ideas to improve the output further. However, there was a major problem:
cg_annotate
was written in Perl. This was a reasonable choice when I
originally wrote it in 2002, but it’s not good in 2023. Also, the code
structure was poor, with lots of global variables and some very long functions.
As a result, cg_annotate
was difficult to modify, despite not being that
complicated.
The solution? I rewrote it in Python. (Why not Rust? Valgrind already depends on Python, but doesn’t depend on Rust, and I didn’t want to add a Rust dependency.) The Python version is a little shorter and runs twice as fast. More importantly, it is also much easier to modify, due to (a) Python being nicer than Perl (especially the support for static types) and (b) better code structure (I’m a better programmer today than I was in 2002).
I have since made some small functionality changes to the new cg_annotate
,
and I will make larger ones soon. This will hopefully lead to more rustc speed
improvements in the future.
Other non-performance things
Since my last post I also did a few significant things unrelated to compiler performance.
#104429: In this PR I made a
small language change that affects how derive
is implemented for packed
structs. The new approach is simpler, works with generic packed structs, and
works with packed structs that impl Copy
without explicitly deriving Copy
.
The new approach also fails to compile some obscure cases that the old approach
did, and a crater run showed a handful of crates were affected by this. This
breakage was deemed acceptable by the lang team because this change ties in
with the removal of some language unsoundness caused by unaligned
references. In particular, this
change unbreaks compilation of version 0.2 of the popular winapi
crate, which
the unaligned references work would have otherwise broken.
#101841: I finally merged this
PR removing -Zsave-analysis
, an experimental compiler feature long slated for
removal. This took a while because we decided to give everyone an additional
three months notice to prepare for this change.
I took part in an episode of the Software Unscripted podcast with Richard Feldman called “Speeding up Rust’s Compiler”.
I also had six weeks of vacation over Christmas, which was my first long stretch of time off in two years, hooray!
General Progress
For the period 2022-10-25 to 2023-03-22 we had some good results on wall-time measurements.
- There were 523 results measured across 41 benchmarks.
- Of the 247 changes that met the significance threshold, 188 were improvements, and 59 were regressions. The mean change across these results was a reduction of 6.80%.
- If we consider all results, 332 were improved, 191 were worse, and the mean result was a reduction of 2.98%.
As always, these measurements are done on Linux.
This is a smaller improvement than we saw in my last blog post. However, compilers tend to get slower over time if careful effort isn’t made, so I still consider this a positive outcome. Thank you to everyone who contributed!