It has been more than six months since my last post on the Rust compiler’s performance. In that time I lost one job and gained another. I have less time to work directly on the Rust compiler than I used to, but I am still doing some stuff, while also working on other interesting things.

Compiler improvements

#142095: The compiler has a data structure called VecCache which is a key-value store used with keys that are densely-numbered IDs, such as CrateNum or LocalDefId. It’s a segmented vector with increasingly large buckets added as it grows. In this PR Josh Triplett optimized the common case when the key is in the first segment, which holds 4096 entries. This gave icount reductions across many benchmark runs, beyond 4% in the best cases.

#148040: In this PR Ben Kimock added a fast path for lowering trivial consts. This reduced compile times for the libc crate by 5-15%! It’s unusual to see a change that affects a single real-world crate so much, across all compilation scenarios: debug and release, incremental and non-incremental. This is a great result. At the time of writing, libc is the #12 mostly popular crate on crates.io as measured by “recent downloads”, and #7 as measured by “all-time downloads”. This change also reduced icounts for a few other benchmarks by up to 10%.

#147293: In the query system there was a value computed on a hot path that was only used within a debug! call. In this PR I avoided doing that computation unless necessary, which gave icount reductions across many benchmark results, more than 3% in the best case. This was such a classic micro-optimization that I added it as an example to the Logging and Debugging chapter of the The Rust Performance Book.

#148706: In this PR dianne optimized the handling of temporary scopes. This reduced icounts on a number of benchmarks, 3% in the best case. It also reduced peak memory usage on some of the secondary benchmarks containing very large literals, by 5% in the best cases.

#143684: In this PR Nikita Popov upgraded the LLVM version used by the compiler to LLVM 21. In recent years every LLVM update has improved the speed of the Rust compiler. In this case the mean icount reduction across all benchmark results was an excellent 1.70%, and the mean cycle count reduction was 0.90%, but the mean wall-time saw an increase of 0.26%. Wall-time is the true metric, because it’s what users perceive, though it has high variance. icounts and cycles usually correlate well to wall-time, especially on large changes like this that affect many benchmarks, though this case is a counter-example. I’m not quite sure what to make of it; I don’t know whether the wall-time results on the test machine are representative.

#148789: In this PR Mara Bos reimplemented format_args!() and fmt::Arguments to be more space-efficient. This gave lots of small icount wins, and a couple of enormous (30-38%) wins for the large-workspace stress test. Mara wrote about this on Mastodon. She also has written about prior work on formatting on her blog and in this tracking issue. Lots of great reading there for people who love nitty-gritty optimization details, including nice diagrams of how data structures are laid out in memory.

Proc macro wins in Bevy

In June I added a new compiler flag -Zmacro-stats that measures how much code is generated by macros. I wrote previously about how I used it to optimize #[derive(Arbitrary)] from the arbitrary crate used for fuzzing.

I also used it to streamline the code generated by #[derive(Reflect)] in Bevy. This derive is used to implement reflection on many types and it produced a lot of code. For example, the bevy_ui crate was around 16,000 lines and 563,000 bytes of source code. The code generated by #[derive(Reflect)] for types within that crate was around 27,000 lines and 1,544,000 bytes. Macro expansion almost quadrupled the size of the code, mostly because of this one macro!

The code generated by #[derive(Reflect)] had a lot of redundancies. I made PRs to remove unnecessary calls, duplicate type bounds (and a follow-up), const _ blocks, closures, arguments, trait bounds, attributes, impls, and finally I factored out some repetition.

After doing this I measured the bevy_window crate. The size of the code generated by #[derive(Reflect)] was reduced by 39%, which reduced cargo check wall-time for that crate by 16%, and peak memory usage by 5%. And there are likely similar improvements across many other crates within Bevy, as well as programs that use #[derive(Reflect)] themselves.

It’s understandable that the generated code was suboptimal. Proc macros aren’t easy to write; there was previously no easy way to measure the size of the generated code; and the generated code was considered good enough because (a) it worked, and (b) the compiler would effectively optimize away all the redundancies. But in general it is more efficient to optimize away redundancies at the generation point, where context-specific and domain-specific information is available, rather than relying on sophisticated optimization machinery further down the compilation pipeline that has to reconstruct information. And it’s just less code to parse and represent in memory.

rustdoc-json

At RustWeek 2025 I had a conversation with Predrag Gruevski about rustdoc-json (invoked with the --output-format=json flag) and its effects on the performance of cargo-semver-checks. I spent some time looking into it and found one nice win.

#142335: In this PR I reduced the number of allocations done by rustdoc-json. This gave wall-time reductions of up to 10% and peak memory usage reductions of up to 8%.

I also tried various other things to improve rustdoc-json’s speed, without much success. JSON is simple and easy to parse, and rustdoc-json’s schema for representing Rust code is easy for humans to read. These features are great for newcomers and people who want to experiment. It also means the JSON output is space-inefficient, which limits the performance of heavy-duty tools like cargo-semver-checks that are designed for large codebases. There are some obvious space optimizations that could be applied to the JSON schema, like shortening field names, omitting fields with default values, and interning repeated strings. But these all affect its readability and flexibility.

The right solution here is probably to introduce a performance-oriented second format for the heavy-duty users. #142642 is a draft attempt at this. Hopefully progress can be made here in the future.

Faster compilation of large API crates

Josh Triplett introduced a new experimental flag, -Zhint-mostly-unused, which can give big compile time wins for people using small fractions of very large crates. This is typically the case for certain large API crates, such as windows, rustix, and aws-sdk-ec2. Read about it here.

Faster Rust builds on Mac

Did you know that macOS has a secret setting that can make Rust builds faster? No joke!

General progress

Progress since May must be split into two parts, because in July we changed the machine on which the measurements are done.

The first period (2025-05-20 to 2025-06-30) was on the old machine. The second period (2025-07-01 to 2025-12-03) was on the new machine.

The mean wall-time changes were moderate improvements (-3.19% and -2.65%). The mean peak memory usage changes were a wash (+1.18% and -1.50%). The mean binary size changes were small increases (0.45% and 2.56%).

It’s good that wall-times went down overall, even if the other metrics were mixed. There is a slow but steady stream of bug fixes and new features to the compiler, which often hurt performance. In the absence of active performance work the natural tendency for a compiler is to get slower, so I view even small improvements as a win.

The new machine reduced wall-times by about 20%. It’s worth upgrading your hardware, if you can!