Build Configuration
You can drastically change the performance of a Rust program without changing its code, just by changing its build configuration. There are many possible build configurations for each Rust program. The one chosen will affect several characteristics of the compiled code, such as compile times, runtime speed, memory use, binary size, debuggability, profilability, and which architectures your compiled program will run on.
Most configuration choices will improve one or more characteristics while worsening one or more others. For example, a common trade-off is to accept worse compile times in exchange for higher runtime speeds. The right choice for your program depends on your needs and the specifics of your program, and performance-related choices (which is most of them) should be validated with benchmarking.
It is worth reading this chapter carefully to understand all the build
configuration choices. However, for the impatient or forgetful,
cargo-wizard
encapsulates this information and can help you choose an
appropriate build configuration.
Note that Cargo only looks at the profile settings in the Cargo.toml
file at
the root of the workspace. Profile settings defined in dependencies are
ignored. Therefore, these options are mostly relevant for binary crates, not
library crates.
Release Builds
The single most important build configuration choice is simple but easy to
overlook: make sure you are using a release build rather than a dev build
when you want high performance. This is usually done by specifying the
--release
flag to Cargo.
Dev builds are the default. They are good for debugging, but are not optimized.
They are produced if you run cargo build
or cargo run
. (Alternatively,
running rustc
without additional options also produces an unoptimized build.)
Consider the following final line of output from a cargo build
run.
Finished dev [unoptimized + debuginfo] target(s) in 29.80s
This output indicates that a dev build has been produced. The compiled code
will be placed in the target/debug/
directory. cargo run
will run the dev
build.
In comparison, release builds are much more optimized, omit debug assertions
and integer overflow checks, and omit debug info. 10-100x speedups over dev
builds are common! They are produced if you run cargo build --release
or
cargo run --release
. (Alternatively, rustc
has multiple options for
optimized builds, such as -O
and -C opt-level
.) This will typically take
longer than a dev build because of the additional optimizations.
Consider the following final line of output from a cargo build --release
run.
Finished release [optimized] target(s) in 1m 01s
This output indicates that a release build has been produced. The compiled code
will be placed in the target/release/
directory. cargo run --release
will
run the release build.
See the Cargo profile documentation for more details about the differences
between dev builds (which use the dev
profile) and release builds (which use
the release
profile).
The default build configuration choices used in release builds provide a good balance between the abovementioned characteristics such as compile times, runtime speed, and binary size. But there are many possible adjustments, as the following sections explain.
Maximizing Runtime Speed
The following build configuration options are designed primarily to maximize runtime speed. Some of them may also reduce binary size.
Codegen Units
The Rust compiler splits crates into multiple codegen units to parallelize
(and thus speed up) compilation. However, this might cause it to miss some
potential optimizations. You may be able to improve runtime speed and reduce
binary size, at the cost of increased compile times, by setting the number of
units to one. Add these lines to the Cargo.toml
file:
[profile.release]
codegen-units = 1
Link-time Optimization
Link-time optimization (LTO) is a whole-program optimization technique that can improve runtime speed by 10-20% or more, and also reduce binary size, at the cost of worse compile times. It comes in several forms.
The first form of LTO is thin local LTO, a lightweight form of LTO. By
default the compiler uses this for any build that involves a non-zero level of
optimization. This includes release builds. To explicitly request this level of
LTO, put these lines in the Cargo.toml
file:
[profile.release]
lto = false
The second form of LTO is thin LTO, which is a little more aggressive, and
likely to improve runtime speed and reduce binary size while also increasing
compile times. Use lto = "thin"
in Cargo.toml
to enable it.
The third form of LTO is fat LTO, which is even more aggressive, and may
improve performance and reduce binary size further while increasing build
times again. Use lto = "fat"
in Cargo.toml
to enable it.
Finally, it is possible to fully disable LTO, which will likely worsen runtime
speed and increase binary size but reduce compile times. Use lto = "off"
in
Cargo.toml
for this. Note that this is different to the lto = false
option,
which, as mentioned above, leaves thin local LTO enabled.
Alternative Allocators
It is possible to replace the default (system) heap allocator used by a Rust program with an alternative allocator. The exact effect will depend on the individual program and the alternative allocator chosen, but large improvements in runtime speed and large reductions in memory usage have been seen in practice. The effect will also vary across platforms, because each platform’s system allocator has its own strengths and weaknesses. The use of an alternative allocator is also likely to increase binary size and compile times.
jemalloc
One popular alternative allocator for Linux and Mac is jemalloc, usable via
the tikv-jemallocator
crate. To use it, add a dependency to your
Cargo.toml
file:
[dependencies]
tikv-jemallocator = "0.5"
Then add the following to your Rust code, e.g. at the top of src/main.rs
:
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
Furthermore, on Linux, jemalloc can be configured to use transparent huge pages (THP). This can further speed up programs, possibly at the cost of higher memory usage.
Do this by setting the MALLOC_CONF
environment variable appropriately before
building your program, for example:
MALLOC_CONF="thp:always,metadata_thp:always" cargo build --release
The system running the compiled program also has to be configured to support THP. See this blog post for more details.
mimalloc
Another alternative allocator that works on many platforms is mimalloc,
usable via the mimalloc
crate. To use it, add a dependency to your
Cargo.toml
file:
[dependencies]
mimalloc = "0.1"
Then add the following to your Rust code, e.g. at the top of src/main.rs
:
#[global_allocator]
static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;
CPU Specific Instructions
If you do not care about the compatibility of your binary on older (or other types of) processors, you can tell the compiler to generate the newest (and potentially fastest) instructions specific to a certain CPU architecture, such as AVX SIMD instructions for x86-64 CPUs.
To request these instructions from the command line, use the -C target-cpu=native
flag. For example:
RUSTFLAGS="-C target-cpu=native" cargo build --release
Alternatively, to request these instructions from a config.toml
file (for
one or more projects), add these lines:
[build]
rustflags = ["-C", "target-cpu=native"]
This can improve runtime speed, especially if the compiler finds vectorization opportunities in your code.
If you are unsure whether -C target-cpu=native
is working optimally, compare
the output of rustc --print cfg
and rustc --print cfg -C target-cpu=native
to see if the CPU features are being detected correctly in the latter case. If
not, you can use -C target-feature
to target specific features.
Profile-guided Optimization
Profile-guided optimization (PGO) is a compilation model where you compile your program, run it on sample data while collecting profiling data, and then use that profiling data to guide a second compilation of the program. This can improve runtime speed by 10% or more. Example 1, Example 2.
It is an advanced technique that takes some effort to set up, but is worthwhile
in some cases. See the rustc PGO documentation for details. Also, the
cargo-pgo
command makes it easier to use PGO (and BOLT, which is similar)
to optimize Rust binaries.
Unfortunately, PGO is not supported for binaries hosted on crates.io and
distributed via cargo install
, which limits its usability.
Minimizing Binary Size
The following build configuration options are designed primarily to minimize binary size. Their effects on runtime speed vary.
Optimization Level
You can request an optimization level that aims to minimize binary size by
adding these lines to the Cargo.toml
file:
[profile.release]
opt-level = "z"
This may also reduce runtime speed.
An alternative is opt-level = "s"
, which targets minimal binary size a little
less aggressively. Compared to opt-level = "z"
, it allows slightly more
inlining and also the vectorization of loops.
Abort on panic!
If you do not need to unwind on panic, e.g. because your program doesn’t use
catch_unwind
, you can tell the compiler to simply abort on panic. On
panic, your program will still produce a backtrace.
This might reduce binary size and increase runtime speed slightly, and may even
reduce compile times slightly. Add these lines to the Cargo.toml
file:
[profile.release]
panic = "abort"
Strip Debug Info and Symbols
You can tell the compiler to strip debug info and symbols from the compiled
binary. Add these lines to Cargo.toml
to strip just debug info:
[profile.release]
strip = "debuginfo"
Alternatively, use strip = "symbols"
to strip both debug info and symbols.
Prior to Rust 1.77, the default behaviour was to do no stripping. As of Rust 1.77 the default behaviour is to strip debug info in release builds.
Stripping debug info can greatly reduce binary size. On Linux, the binary size of a small Rust programs might shrink by 4x when debug info is stripped. Stripping symbols can also reduce binary size, though generally not by as much. Example. The exact effects are platform-dependent.
However, stripping makes your compiled program more difficult to debug and profile. For example, if a stripped program panics, the backtrace produced may contain less useful information than normal. The exact effects for the two levels of stripping depend on the platform.
Other Ideas
For more advanced binary size minimization techniques, consult the
comprehensive documentation in the excellent min-sized-rust
repository.
Minimizing Compile Times
The following build configuration options are designed primarily to minimize compile times.
Linking
A big part of compile time is actually linking time, particularly when rebuilding a program after a small change. It is possible to select a faster linker than the default one.
One option is lld, which is available on Linux and Windows. To specify lld
from the command line, use the -C link-arg=-fuse-ld=lld
flag. For example:
RUSTFLAGS="-C link-arg=-fuse-ld=lld" cargo build --release
Alternatively, to specify lld from a config.toml
file (for one or more
projects), add these lines:
[build]
rustflags = ["-C", "link-arg=-fuse-ld=lld"]
lld is not fully supported for use with Rust, but it should work for most use cases on Linux and Windows. There is a GitHub Issue tracking full support for lld.
Another option is mold, which is currently available on Linux and macOS.
Simply substitute mold
for lld
in the instructions above. mold is often
faster than lld.
Example.
It is also much newer and may not work in all cases.
Unlike the other options in this chapter, there are no trade-offs here! Alternative linkers can be dramatically faster, without any downsides.
Experimental Parallel Front-end
If you use nightly Rust, you can enable the experimental parallel front-end. It may reduce compile times at the cost of higher compile-time memory usage. It won’t affect the quality of the generated code.
You can do that by adding -Zthreads=N
to RUSTFLAGS, for example:
RUSTFLAGS="-Zthreads=8" cargo build --release
Alternatively, to enable the parallel front-end from a config.toml
file (for
one or more projects), add these lines:
[build]
rustflags = ["-Z", "threads=8"]
Values other than 8
are possible, but that is the number that tends to give
the best results.
In the best cases, the experimental parallel front-end reduces compile times by up to 50%. But the effects vary widely and depend on the characteristics of the code and its build configuration, and for some programs there is no compile time improvement.
Cranelift Codegen Back-end
If you use nightly Rust on x86-64/Linux or ARM/Linux, you can enable the Cranelift codegen back-end. It may reduce compile times at the cost of lower quality generated code, and therefore is recommended for dev builds rather than release builds.
First, install the back-end with this rustup
command:
rustup component add rustc-codegen-cranelift-preview --toolchain nightly
To select Cranelift from the command line, use the
-Zcodegen-backend=cranelift
flag. For example:
RUSTFLAGS="-Zcodegen-backend=cranelift" cargo +nightly build
Alternatively, to specify Cranelift from a config.toml
file (for one or
more projects), add these lines:
[unstable]
codegen-backend = true
[profile.dev]
codegen-backend = "cranelift"
For more information, see the Cranelift documentation.
Custom profiles
In addition to the dev
and release
profiles, Cargo supports custom
profiles. It might be useful, for example, to create a custom profile halfway
between dev
and release
if you find the runtime speed of dev builds
insufficient and the compile times of release builds too slow for everyday
development.
Summary
There are many choices to be made when it comes to build configurations. The following points summarize the above information into some recommendations.
- If you want to maximize runtime speed, consider all of the following:
codegen-units = 1
,lto = "fat"
, an alternative allocator, andpanic = "abort"
. - If you want to minimize binary size, consider
opt-level = "z"
,codegen-units = 1
,lto = "fat"
,panic = "abort"
, andstrip = "symbols"
. - In either case, consider
-C target-cpu=native
if broad architecture support is not needed, andcargo-pgo
if it works with your distribution mechanism. - Always use a faster linker if you are on a platform that supports it, because there are no downsides to doing so.
- Use
cargo-wizard
if you need additional help with these choices. - Benchmark all changes, one at a time, to ensure they have the expected effects.
Finally, this issue tracks the evolution of the Rust compiler’s own build configuration. The Rust compiler’s build system is stranger and more complex than that of most Rust programs. Nonetheless, this issue may be instructive in showing how build configuration choices can be applied to a large program.