Description
When manipulating array of numbers, it is pretty common to have to find the min/max/sum/... of it. While discussing about internals with fellow developers, someone pointed out that the C# max method leverages SIMD. By curiosity I checked for both C++ and Rust.
My findings are as follow:
- LLVM is able to auto vectorize this kind of stuff
- the C++ STL max_element function leverages that
- my custom implementation is able to leverages that
- the Rust Iter functions (max, min) cannot
This last bullet is due to the fact that the implementation does not expect the type to implement the Copy trait, and operates over references, and not actual type of the array.
let my_array = (0..ITEM_COUNT).collect::<Vec<_>>();
// This is slow
#[inline(never)]
pub fn stdlib_max<T: Ord + Copy>(a: &[T]) -> Option<T> {
a.iter().max().copied()
}
// This is fast
#[inline(never)]
pub fn custom_max<T: Ord + Copy>(a: &[T]) -> Option<T> {
let first = *a.first()?;
Some(a.iter().fold(first, |x, y| std::cmp::max(x, *y)))
}
=> Still, as an end user, I would have expected that the "rust way" to do the thing (with iterator) would be optimal, and it is not.
I link a small repository with a sample and bench pointing the issue:
[https://github.com/jfaixo/rust-max-bench]
For finding the max of a [i32; 100_000]
array :
❯ rustc -vV
rustc 1.68.0-nightly (388538fc9 2023-01-05)
binary: rustc
commit-hash: 388538fc963e07a94e3fc3ac8948627fd2d28d29
commit-date: 2023-01-05
host: x86_64-unknown-linux-gnu
release: 1.68.0-nightly
LLVM version: 15.0.6
❯ cargo bench
Finished bench [optimized] target(s) in 0.00s
Running unittests src/lib.rs (target/release/deps/rust_max_bench-a5d988f9520f9dde)
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
Running benches/bench.rs (target/release/deps/bench-cf556ddbd1b864fb)
running 3 tests
test custom ... bench: 8,052 ns/iter (+/- 385)
test itertools ... bench: 94,027 ns/iter (+/- 816)
test stdlib ... bench: 94,477 ns/iter (+/- 1,545)
test result: ok. 0 passed; 0 failed; 0 ignored; 3 measured; 0 filtered out; finished in 2.40s