Joseph Richey 7089766df0
Add/Rework benchmarks to track initialization cost (#272)
This PR adds more benchmarks so we can get and accurate idea about two
things:

  - What is the cost of having to zero the buffer before calling
    `getrandom`?
  - What is the performance on aligned, 32-byte buffers?
    - This is by far the most common use, as its used to seed
      usersapce CSPRNGs.

I ran the benchmarks on my system:
  - CPU: AMD Ryzen 7 5700G
  - OS: Linux 5.15.52-1-lts
  - Rust Version: 1.62.0-nightly (ea92b0838 2022-05-07)

I got the following results:
```
test bench_large      ... bench:   3,759,323 ns/iter (+/- 177,100) = 557 MB/s
test bench_large_init ... bench:   3,821,229 ns/iter (+/- 39,132) = 548 MB/s
test bench_page       ... bench:       7,281 ns/iter (+/- 59) = 562 MB/s
test bench_page_init  ... bench:       7,290 ns/iter (+/- 69) = 561 MB/s
test bench_seed       ... bench:         206 ns/iter (+/- 3) = 155 MB/s
test bench_seed_init  ... bench:         206 ns/iter (+/- 1) = 155 MB/s
```

These results were very consistent across multiple runs, and roughtly
behave as we would expect:
  - The thoughput is highest with a buffer large enough to amoritize the
    syscall overhead, but small enough to stay in the L1D cache.
  - There is a _very_ small cost to zeroing the buffer beforehand.
  - This cost is imperceptible in the common 32-byte usecase, where the
    syscall overhead dominates.
  - The cost is slightly higher (1%) with multi-megabyte buffers as the
    data gets evicted from the L1 cache between the `memset` and the
    call to `getrandom`.

I would love to see results for other platforms. Could we get someone to
run this on an M1 Mac?

Signed-off-by: Joe Richey <joerichey@google.com>
2022-07-13 06:04:41 -07:00
..