29 November 2016

How fast can I build Rust?

I've been collecting some data on the fastest way to build the Rust compiler. This is primarily for Rust developers to optimise their workflow, but it might also be of general interest.

TL;DR: the fastest ways to build Rust (on a computer with lots of cores) is with -j6, RUSTFLAGS=-Ccodegen-units=10.

I tested using a commit, from the 24th November 2016. I was using the make build system (though I would expect the same results using Rustbuild). The test machine is a dedicated build machine - it has 12 physical cores, lots of RAM, and an SSD. It wasn't used for anything else during the benchmarking, and doesn't run a windowing system. It was running Ubuntu 16.10 (Linux). I only did one run per set of variables. That is not ideal, but where I repeated runs, they were fairly consistent - usually within a second or two and never more than 10. I've rounded all results to the nearest 10 seconds, and I believe that level of precision is about right for the experiment.

I varied the number of jobs (-jn) and the number of codegen units (RUSTFLAGS=-Ccodegen-units=n). The command line looked something like RUSTFLAGS=-Ccodegen-units=10 make -j6. I measured the time to do a normal build of the whole compiler and libraries (make), to build the stage1 compiler (make rustc-stage, this is the minimal amount of work required to get a compiler for testing), and to build and bootstrap the compiler and run all tests (make && make check, I didn't run a simple make check because adding -jn to that causes many tests to be skipped; setting codegen-units > 1 causes some tests to fail).

The jobs number is the number of tasks make can run in parallel. These runs are self-contained instances of the compiler, i.e., this is parallelism outside the compiler. The amount of parallelism is limited by dependencies between crates in the compiler. Since the crates in the compiler are rather large and there are a lot of dependencies, the benefits of using a large number of jobs is much weaker than in a typical C or C++ program (e.g., LLVM). Note however that there is no real drawback to using a larger number of jobs, there just won't be any benefit.

Codegen units introduce parallelism within the compiler. First, some background. Compilation can be roughly split into two: first, code is analysed (parsing, type checking, etc.), then object code is generated from the products of analysis. The Rust compiler uses LLVM for the code generation part. Roughly half the time running an optimised build is spent in each of analysis and code generation. Nearly all optimisation is performed in the code generation part.

The compilation unit in Rust is a crate; that is, the Rust compiler analyses and compiles a single crate at a time. By default, code generation also works at the same level of granularity. However, by specifying the number of codegen units, we tell the compiler that once analysis is complete, it should break the crate into smaller units and run LLVM code generation on each unit in parallel. That means we get parallelism inside the compiler, albeit only for about half of the work. There is a disadvantage, however: using multiple codegen units means the program will not be optimised as well as if a single unit were used. This is analogous to turning off LTO in a C program. For this reason, you should not use multiple codegen units when building production software.

So when building the compiler, if we use many codegen units we might expect the compilation to go faster, but when we run the new compiler, it will be slower. Since we use the new compiler to build at least the libraries and sometimes another compiler, this could be an important factor in the total time.

If you're interested in this kind of thing, we keep track of compiler performance at perf.r-l.o (although only single-threaded builds). Nicholas Nethercote has recently written a couple of blog posts on running and optimising the compiler.

`make`

This experiment ran a simple make build. It builds two versions of the compiler - one using the last beta, and the second using the first.

	cg1	cg2	cg4	cg6	cg8	cg10	cg12
-j1	48m50s	39m40s	31m30s	29m50s	29m20s
-j2	34m10s	27m40s	21m40s	20m30s	20m10s	19m30s	19m20s
-j4	28m10s	23m00s	17m50s	16m50s	16m40s	16m00s	16m00s
-j6	27m40s	22m40s	17m20s	16m20s	16m10s	15m40s	15m50s
-j8	27m40s	22m30s	17m20s	16m30s	16m30s	15m40s	15m40s
-j10	27m40s
-j12	27m40s
-j14	27m50s
-j16	27m50s

In general, we get better results using more jobs and more codegen units. Looking at the number of jobs, there is no improvement after 6. For codegen units, the improvements quickly diminish, but there is some improvement right up to using 10 (for all jobs > 2, 12 codegen units gave the same result as 10). It is possible that 9 or 11 codegen units may be more optimal (I only tested even numbers), but probably not by enough to be significant, given the precision of the experiment.

`make rustc-stage1`

This experiment ran make rustc-stage1. That builds a single compiler and the libraries necessary to use that compiler. It is the minimal amount of work necessary to test modifications to the compiler. It is significantly quicker than make.

	cg1	cg2	cg4	cg6	cg8	cg10	cg12
-j1	15m10s	12m10s	9m40s	9m10s	9m10s	8m50s	8m50s
-j2	11m00s	8m50s	6m50s	6m20s	6m20s	6m00s	6m00s
-j4	9m00s	7m30s	5m40s	5m20s	5m20s	5m10s	5m00s
-j6	9m00s	7m10s	5m30s	5m10s	5m00s	5m00s	5m00s

I only tested jobs up to 6, since there seems no way for more jobs to be profitable here, if not in the previous experiment. It turned out that 6 jobs was only marginally better than 4 in this case, I assume because of more dependency bottlenecks, relative to a full make.

I would expect more codegen units to be more effective here (since we're using the resulting compiler for less), but I was wrong. This may just be due to the precision of the test (and the relatively shorter total time), but for all numbers of jobs, 6 codegen units were as good as more. So, for this kind of build, six jobs and six codegen units is optimal; however, using ten codegen units (as for make) is not harmful.

`make -jn && make check`

This experiment is the way to build all the compilers and libraries and run all tests. I measured the two parts separately. As you might expect, the first part corresponded exactly with the results of the make experiment. The second part (make check) took a fairly consistent amount of time - it is independent of the number of jobs since the test infrastructure does its own parallelisation. I would expect compilation of tests to be slower with a compiler compiled with a larger number of codegen units. For one or two codegen units, make check took 12m40s, for four to ten, it took 12m50s, a marginal difference. That means that the optimal build used six jobs and ten codegen units (as for make), giving a total time of 28m30s (c.f., 61m40s for one job and one codegen unit).

How fast can I build Rust?

`make`

`make rustc-stage1`

`make -jn && make check`

Rust is about productivity

Macros (and syntax extensions and compiler plugins) - where are we at?