RFC 2360: Add bench_black_box to core

libs | compiler (hint | profiling)

Summary

This RFC adds core::hint::bench_black_box (see black box), an identity function that hints the compiler to be maximally pessimistic in terms of the assumptions about what bench_black_box could do.

Motivation

Due to the constrained nature of synthetic benchmarks, the compiler is often able to perform optimizations that wouldn't otherwise trigger in practice, like completely removing a benchmark if it has no side-effects.

Currently, stable Rust users need to introduce expensive operations into their programs to prevent these optimizations. Examples thereof are volatile loads and stores, or calling unknown functions via C FFI. These operations incur overheads that often would not be present in the application the synthetic benchmark is trying to model.

Guide-level explanation

hint::bench_black_box

The hint:

pub fn bench_black_box<T>(x: T) -> T;

behaves like the identity function: it just returns x and has no effects. However, Rust implementations are encouraged to assume that bench_black_box can use x in any possible valid way that Rust code is allowed to without introducing undefined behavior in the calling code. That is, implementations are encouraged to be maximally pessimistic in terms of optimizations.

This property makes bench_black_box useful for writing code in which certain optimizations are not desired, but too unreliable when disabling these optimizations is required for correctness.

Example 1 - basics

Example 1 (rust.godbolt.org):

fn foo(x: i32) -> i32 { 
  hint::bench_black_box(2 + x);
  3
}
let a = foo(2);

In this example, the compiler may simplify the expression 2 + x down to 4. However, even though 4 is not read by anything afterwards, it must be computed and materialized, for example, by storing it into memory, a register, etc. because the current Rust implementation assumes that bench_black_box could try to read it.

Example 2 - benchmarking Vec::push

The hint::bench_black_box is useful for producing synthetic benchmarks that more accurately represent the behavior of a real application. In the following example, the function bench executes Vec::push 4 times in a loop:

fn push_cap(v: &mut Vec<i32>) {
    for i in 0..4 {
      v.push(i);
    }
}

pub fn bench_push() -> Duration { 
    let mut v = Vec::with_capacity(4);
    let now = Instant::now();
    push_cap(&mut v);
    now.elapsed()
}

This example allocates a Vec, pushes into it without growing its capacity, and drops it, without ever using it for anything. The current Rust implementation emits the following x86_64 machine code (https://rust.godbolt.org/z/wDckJF):

example::bench_push:
  sub rsp, 24
  call std::time::Instant::now@PLT
  mov qword ptr [rsp + 8], rax
  mov qword ptr [rsp + 16], rdx
  lea rdi, [rsp + 8]
  call std::time::Instant::elapsed@PLT
  add rsp, 24
  ret

LLVM is pretty amazing: it has optimized the Vec allocation and the calls to push_cap away. In doing so, it has made our benchmark useless. It won't measure the time it takes to perform the calls to Vec::push as we intended.

In real applications, the program will use the vector for something, preventing these optimizations. To produce a benchmark that takes that into account, we can hint the compiler that the Vec is used for something (https://rust.godbolt.org/z/CeXmxN):

fn push_cap(v: &mut Vec<i32>) {
    for i in 0..4 {
        bench_black_box(v.as_ptr());
        v.push(bench_black_box(i));
        bench_black_box(v.as_ptr());
    }
}

Inspecting the machine code reveals that, for this particular Rust implementation, bench_black_box successfully prevents LLVM from performing the optimization that removes the Vec::push calls that we wanted to measure.

Reference-level explanation

The

mod core::hint {
    /// Identity function that disables optimizations.
    pub fn bench_black_box<T>(x: T) -> T;
}

is a NOP that returns x, that is, its operational semantics are equivalent to the identity function.

Implementations are encouraged, but not required, to treat bench_black_box as an unknown function that can perform any valid operation on x that Rust is allowed to perform without introducing undefined behavior in the calling code. That is, to optimize bench_black_box under the pessimistic assumption that it might do anything with the data it got, even though it actually does nothing.

Drawbacks

Slightly increases the surface complexity of libcore.

Rationale and alternatives

Further rationale influencing this design is available in https://github.com/nikomatsakis/rust-memory-model/issues/45

clobber

A previous version of this RFC also provided a clobber function:

/// Flushes all pending writes to memory. 
pub fn clobber() -> ();

In https://github.com/nikomatsakis/rust-memory-model/issues/45 it was realized that such a function cannot work properly within Rust's memory model.

value_fence / evaluate_and_drop

An alternative design was proposed during the discussion on rust-lang/rfcs/issues/1484, in which the following two functions are provided instead:

#[inline(always)]
pub fn value_fence<T>(x: T) -> T {
    let y = unsafe { (&x as *T).read_volatile() };
    std::hint::forget(x);
    y
}

#[inline(always)]
pub fn evaluate_and_drop<T>(x: T) {
    unsafe {
        let mut y = std::hint::uninitialized();
        std::ptr::write_volatile(&mut y as *mut T, x);
        drop(y); // not necessary but for clarity
    }
}

This approach is not pursued in this RFC because these two functions:

bench_input / bench_outpu

@eddyb proposed here (and the discussion that followed) to add two other hints instead:

From the RFC discussion there was consensus that we might want to add these benchmarking hints in the future as well because their are easier to specify and provide stronger guarantees than bench_black_box.

Right now, however, it is unclear whether these two hints can be implemented strictly in LLVM. The comment thread shows that the best we can actually do ends up implementing both of these as bench_black_box with the same effects.

Without a strict implementation, it is unclear which value these two intrinsics would add, and more importantly, since their difference in semantics cannot be shown, it is also unclear how we could teach users to use them correctly.

If we ever able to implement these correctly, we might want to consider deprecating bench_black_box at that point, but whether it will be worth deprecating is not clear either.

Prior art

Similar functionality is provided in the Google Benchmark C++ library: are called DoNotOptimize (bench_black_box) and ClobberMemory. The black_box function with slightly different semantics is provided by the test crate: test::black_box.

Unresolved questions