Skip to content

Iterator::max with reference-type items cannot leverage SIMD instructions, resulting in low performance #106539

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jfaixo opened this issue Jan 6, 2023 · 6 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@jfaixo
Copy link

jfaixo commented Jan 6, 2023

When manipulating array of numbers, it is pretty common to have to find the min/max/sum/... of it. While discussing about internals with fellow developers, someone pointed out that the C# max method leverages SIMD. By curiosity I checked for both C++ and Rust.

My findings are as follow:

  • LLVM is able to auto vectorize this kind of stuff
  • the C++ STL max_element function leverages that
  • my custom implementation is able to leverages that
  • the Rust Iter functions (max, min) cannot

This last bullet is due to the fact that the implementation does not expect the type to implement the Copy trait, and operates over references, and not actual type of the array.

let my_array = (0..ITEM_COUNT).collect::<Vec<_>>();

// This is slow
#[inline(never)]
pub fn stdlib_max<T: Ord + Copy>(a: &[T]) -> Option<T> {
    a.iter().max().copied()
}

// This is fast
#[inline(never)]
pub fn custom_max<T: Ord + Copy>(a: &[T]) -> Option<T> {
    let first = *a.first()?;
    Some(a.iter().fold(first, |x, y| std::cmp::max(x, *y)))
}

=> Still, as an end user, I would have expected that the "rust way" to do the thing (with iterator) would be optimal, and it is not.

I link a small repository with a sample and bench pointing the issue:

[https://github.com/jfaixo/rust-max-bench]

For finding the max of a [i32; 100_000] array :

❯ rustc -vV
rustc 1.68.0-nightly (388538fc9 2023-01-05)
binary: rustc
commit-hash: 388538fc963e07a94e3fc3ac8948627fd2d28d29
commit-date: 2023-01-05
host: x86_64-unknown-linux-gnu
release: 1.68.0-nightly
LLVM version: 15.0.6

❯ cargo bench
    Finished bench [optimized] target(s) in 0.00s
     Running unittests src/lib.rs (target/release/deps/rust_max_bench-a5d988f9520f9dde)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/bench.rs (target/release/deps/bench-cf556ddbd1b864fb)

running 3 tests
test custom    ... bench:       8,052 ns/iter (+/- 385)
test itertools ... bench:      94,027 ns/iter (+/- 816)
test stdlib    ... bench:      94,477 ns/iter (+/- 1,545)

test result: ok. 0 passed; 0 failed; 0 ignored; 3 measured; 0 filtered out; finished in 2.40s
@aDotInTheVoid
Copy link
Member

As suspected, underlying difference is that stdlib does max on &i32, but you'r doing max on i32, which is much more amenable to LLVM: eg by partialy inlining max (godbolt)

Placing the .copied() before the .max() allows you to recapture perf without open coding max (godbolt)

Its possible a clippy lint could help migrate this pitfall, but ultimatly the best solution would be for LLVM to recognize this, and lift the copy to before the max.

@aDotInTheVoid aDotInTheVoid added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. labels Jan 6, 2023
@scottmcm
Copy link
Member

scottmcm commented Jan 6, 2023

This would be extremely difficult for LLVM, because the Iterator::max that return Option<&i32> needs to return the address -- LLVM can't know when compiling max that it's not going to use the address, as the caller might be using that address, say to sub_ptr.

I think this would be much better as a performance-category lint that you should "move the .copied() earlier" for cheap-to-Copy things (8x-usize or less, say). Or just in general to encourage .iter().copied() instead of .iter()on simple stuff like&[i32]`.

@scottmcm scottmcm changed the title Iterator::max cannot leverage SIMD instructions, resulting in low performance Iterator::max with reference-type items cannot leverage SIMD instructions, resulting in low performance Jan 6, 2023
@scottmcm
Copy link
Member

scottmcm commented Jan 6, 2023

Updated the title because Iterator::max absolutely can use SIMD:

// This is fast
pub fn stdlib_max<T: Ord + Copy>(a: &[T]) -> Option<T> {
    a.iter().copied().max()
}

pub fn demo(z: &[i32]) -> Option<i32> {
    stdlib_max(z)
}

The SIMD is most obvious in LLVM IR:

  %12 = tail call <4 x i32> @llvm.smax.v4i32(<4 x i32> %vec.phi, <4 x i32> %wide.load)
  %13 = tail call <4 x i32> @llvm.smax.v4i32(<4 x i32> %vec.phi1, <4 x i32> %wide.load3)

https://play.rust-lang.org/?version=nightly&mode=release&edition=2021&gist=7d61eb9e7c3541986ca8529200cd18e7

@workingjubilee
Copy link
Member

I agree LLVM cannot be expected to handle this for us. Is it possible that we could use MIR-level knowledge to lift the copy ourselves?

@workingjubilee workingjubilee added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jan 7, 2023
@the8472
Copy link
Member

the8472 commented Jan 7, 2023

It might be possible to specialize the max impl for &T where T: Ord + Copy to do the copying and only try to figure out which pointer to use when it finds a larger value.

If several elements are equally maximum, the last element is
returned. If the iterator is empty, [None] is returned.

Well, uh, the opposite sure would be preferable here. But even then it should be possible.

@jfaixo
Copy link
Author

jfaixo commented Jan 8, 2023

I did not know that copied can be called on Iter, that's on me !

It might be possible to specialize the max impl for &T where T: Ord + Copy to do the copying and only try to figure out which pointer to use when it finds a larger value.

That would be great for scalar types that implement Ord :)

The SIMD is most obvious in LLVM IR:

  %12 = tail call <4 x i32> @llvm.smax.v4i32(<4 x i32> %vec.phi, <4 x i32> %wide.load)
  %13 = tail call <4 x i32> @llvm.smax.v4i32(<4 x i32> %vec.phi1, <4 x i32> %wide.load3)

Also more obvious with avx2 vpmaxsd https://godbolt.org/z/q3G5nP4jG

I'm not fluent at all in assembly, but the iterator copy seems to be fully optimised out, is that right ? So this syntax addresses my concern :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants