]> Witch of Git - web/blog/blob - posts/2018/min-benchmark.md
Add a debug command to the Justfile
[web/blog] / posts / 2018 / min-benchmark.md
1 ---
2 title: "Computing Min"
3 date: 2018-08-27
4 tags:
5 - rust
6 - benchmarking
7 summary: "I benchmark a few different implementations of min, because that's fun."
8 ---
9
10 Inspired by the ["gamedev wishlist for rust"][], I got curious if computing the minimum of a bunch of numbers with `min(min(min(a, b), c), d)` was effective.
11 My thinking was that this would produce unnecessary dependency chains in the processor, stopping out-of-order executions of independent `min`s.
12 Also, this was a good excuse to try out [Criterion][], so I set out to measure the impact.
13 One extra node
14
15 ["gamedev wishlist for rust"]: https://users.rust-lang.org/t/my-gamedever-wishlist-for-rust/2859
16 [Criterion]: https://github.com/japaric/criterion.rs
17
18 ## Implementation
19
20 In my actual benchmark I produced two copies of each of these methods specified here.
21 One for `std::cmp::min`, and one for `f32` (since it's not `Ord`).
22 For simplicity, I'll just use the generic one here, they both look pretty much the same.
23
24 ### Loopy
25
26 First, I was curious if a usual `.iter().min()` would perform well.
27 The theory here is that *ideally*, for a known list length, if the compiler thought it was worthwhile, this would compile to the same code as a straight line of `min`.
28 So, our first case is this:
29
30 ```rust
31 [a, b, c, d, e].iter().min()
32 ```
33
34 ### Linear Reduction Macro
35
36 The second method is a macro that will turn `min!(a, b, c, d, e)` into `min(a, min(b, min(c, min(d, e))))`.
37 This is a direct recursive macro that that just accumulates the `min` calls.
38 If you're familiar with Rust macros, nothing *too* scary is going on here.
39
40 ```rust
41 #[macro_export]
42 macro_rules! min {
43 ($x:expr $(,)*) => { $x };
44 ($x:expr $(, $y:expr)* $(,)*) => { ::std::cmp::min($x, min!($($y),*)) };
45 }
46 ```
47
48 ### Tree Reduction Macro
49
50 This macro is quite hairy.
51 The goal is to turn something like `min_tree!(a, b, c, d, e)` into `min(min(a, min(c, e)), min(b, d))` in order to allow the processor to simultaneously execute the leaf `min` calls.
52 Let me walk us through the parts:
53
54 First, we have the `()` case.
55 The `Ord` typeclass doesn't offer us a top element, so we just give an error if there are no arguments.
56 (The float version returns `f32::INFINITY` in this case.)
57
58 Next, we have the base cases.
59 These look very similar to the cases from the `min!` macro, except that the n-element case calls the `@split` case.
60 The `@split` cases are dedicated to taking a list of expressions, and partitioning it into two different lists of expressions.
61 The idea being that if you can split it into two lists, then you can do `min_tree!` to each of those two lists.
62 The first `@split` case pulls two items off the arguments if they're available, and puts one in each accumulator list.
63 The second case is if there's only one argument left, and the final case is for when there are no arguments left.
64 Once the argument list has been split into two parts, we do `min(min_tree!(a...), min_tree!(b...))`, recursively constructing the tree.
65
66 ```rust
67 #[macro_export]
68 macro_rules! min_tree {
69 () => { compile_error!("Cannot compute the minimum of 0 elements") };
70 ($x:expr) => { $x };
71 ($x:expr, $y:expr) => { ::std::cmp::min($x, $y) };
72 ($($x:expr),* $(,)*) => { min_tree!(@split []; []; $($x),*) };
73 (@split [$($a:expr),*]; [$($b:expr),*]; $x:expr, $y:expr $(, $z:expr)*) => {
74 min_tree!(@split [$x $(, $a)*]; [$y $(, $b)*]; $($z),*)
75 };
76 (@split [$($a:expr),*]; [$($b:expr),*]; $x:expr) => {
77 min_tree!(@split [$x $(, $a)*]; [$($b),*];)
78 };
79 (@split [$($a:expr),*]; [$($b:expr),*];) => {
80 ::std::cmp::min(min_tree!($($a),*), min_tree!($($b),*))
81 };
82 }
83 ```
84
85 ## Results
86
87 First of all, I was right, tree reduction is faster, at least for the 10-element `min` I was benchmarking.
88 This is imagining in the context of graphics applications, so we expect relatively small cases, often of a known size (like finding the minimum among 8 neighbors, for instance).
89 What was slightly more surprising to me was that for floats, the loop was faster than the linear reduction.
90 Looking at [godbolt output][] for a hardcoded case shows that they all get vectorized (and the loop gets unrolled), just with slightly different load scheduling.
91
92 [godbolt output]: https://godbolt.org/z/e32CnV
93
94 Criterion produces really cool graphs.
95 Here's the results from the two cases:
96
97 ![violin-i32](/img/2018-08-27-violin-i32.svg)
98 ![violin-f32](/img/2018-08-27-violin-f32.svg)
99
100 I suspect if you want to compute the minimum of a *very* large list, you'll benefit from doing tree reductions on independent chunks in a loop.