-
Notifications
You must be signed in to change notification settings - Fork 13.3k
slice.iter_mut is not zero cost (vs IndexMut) when initializing a fresh vector #31890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@pnkfelix What hardware is this running on? I don't see quite as much of a difference between This is what I get on a mostly idle [email protected] rustc 1.8.0-nightly (0ef8d4260 2016-02-24)
Press ENTER or type command to continue
running 72 tests
test test_add_tens_tens_index ... ignored
test test_add_tens_tens_itermut ... ignored
test _b_0::bench_applicative_add_index ... bench: 994 ns/iter (+/- 3)
test _b_0::bench_applicative_add_itermut ... bench: 1,007 ns/iter (+/- 2)
test _b_0::bench_imperative_add_index ... bench: 919 ns/iter (+/- 1)
test _b_0::bench_imperative_add_itermut ... bench: 467 ns/iter (+/- 2)
test _b_0::bench_imperative_reallocating_add_index ... bench: 1,528 ns/iter (+/- 4)
test _b_0::bench_imperative_reallocating_add_itermut ... bench: 701 ns/iter (+/- 5)
test _b_0::linebreak_applicative_imperative_reallocating ... bench: 0 ns/iter (+/- 0)
test _b_1::bench_applicative_add_index ... bench: 1,991 ns/iter (+/- 14)
test _b_1::bench_applicative_add_itermut ... bench: 2,014 ns/iter (+/- 45)
test _b_1::bench_imperative_add_index ... bench: 1,864 ns/iter (+/- 12)
test _b_1::bench_imperative_add_itermut ... bench: 923 ns/iter (+/- 8)
test _b_1::bench_imperative_reallocating_add_index ... bench: 3,096 ns/iter (+/- 13)
test _b_1::bench_imperative_reallocating_add_itermut ... bench: 1,415 ns/iter (+/- 20)
test _b_1::linebreak_applicative_imperative_reallocating ... bench: 0 ns/iter (+/- 0)
test _b_2::bench_applicative_add_index ... bench: 3,959 ns/iter (+/- 21)
test _b_2::bench_applicative_add_itermut ... bench: 4,007 ns/iter (+/- 19)
test _b_2::bench_imperative_add_index ... bench: 3,654 ns/iter (+/- 8)
test _b_2::bench_imperative_add_itermut ... bench: 1,840 ns/iter (+/- 8)
test _b_2::bench_imperative_reallocating_add_index ... bench: 6,107 ns/iter (+/- 22)
test _b_2::bench_imperative_reallocating_add_itermut ... bench: 3,177 ns/iter (+/- 25)
test _b_2::linebreak_applicative_imperative_reallocating ... bench: 0 ns/iter (+/- 0)
test _b_3::bench_applicative_add_index ... bench: 7,888 ns/iter (+/- 31)
test _b_3::bench_applicative_add_itermut ... bench: 7,991 ns/iter (+/- 59)
test _b_3::bench_imperative_add_index ... bench: 7,298 ns/iter (+/- 17)
test _b_3::bench_imperative_add_itermut ... bench: 3,670 ns/iter (+/- 19)
test _b_3::bench_imperative_reallocating_add_index ... bench: 12,167 ns/iter (+/- 35)
test _b_3::bench_imperative_reallocating_add_itermut ... bench: 6,428 ns/iter (+/- 36)
test _b_3::linebreak_applicative_imperative_reallocating ... bench: 0 ns/iter (+/- 0)
test _b_4::bench_applicative_add_index ... bench: 16,148 ns/iter (+/- 126)
test _b_4::bench_applicative_add_itermut ... bench: 16,366 ns/iter (+/- 221)
test _b_4::bench_imperative_add_index ... bench: 14,645 ns/iter (+/- 148)
test _b_4::bench_imperative_add_itermut ... bench: 7,339 ns/iter (+/- 65)
test _b_4::bench_imperative_reallocating_add_index ... bench: 43,689 ns/iter (+/- 309)
test _b_4::bench_imperative_reallocating_add_itermut ... bench: 32,396 ns/iter (+/- 273)
test _b_4::linebreak_applicative_imperative_reallocating ... bench: 0 ns/iter (+/- 0)
test _b_5::bench_applicative_add_index ... bench: 51,163 ns/iter (+/- 170)
test _b_5::bench_applicative_add_itermut ... bench: 51,692 ns/iter (+/- 400)
test _b_5::bench_imperative_add_index ... bench: 29,256 ns/iter (+/- 186)
test _b_5::bench_imperative_add_itermut ... bench: 14,835 ns/iter (+/- 108)
test _b_5::bench_imperative_reallocating_add_index ... bench: 87,279 ns/iter (+/- 380)
test _b_5::bench_imperative_reallocating_add_itermut ... bench: 65,990 ns/iter (+/- 398)
test _b_5::linebreak_applicative_imperative_reallocating ... bench: 0 ns/iter (+/- 0)
test _b_6::bench_applicative_add_index ... bench: 97,496 ns/iter (+/- 258)
test _b_6::bench_applicative_add_itermut ... bench: 98,849 ns/iter (+/- 556)
test _b_6::bench_imperative_add_index ... bench: 58,496 ns/iter (+/- 298)
test _b_6::bench_imperative_add_itermut ... bench: 29,850 ns/iter (+/- 98)
test _b_6::bench_imperative_reallocating_add_index ... bench: 165,795 ns/iter (+/- 864)
test _b_6::bench_imperative_reallocating_add_itermut ... bench: 124,239 ns/iter (+/- 23,023)
test _b_6::linebreak_applicative_imperative_reallocating ... bench: 0 ns/iter (+/- 0)
test _b_7::bench_applicative_add_index ... bench: 193,556 ns/iter (+/- 1,081)
test _b_7::bench_applicative_add_itermut ... bench: 196,236 ns/iter (+/- 1,349)
test _b_7::bench_imperative_add_index ... bench: 117,039 ns/iter (+/- 334)
test _b_7::bench_imperative_add_itermut ... bench: 59,679 ns/iter (+/- 330)
test _b_7::bench_imperative_reallocating_add_index ... bench: 330,134 ns/iter (+/- 4,477)
test _b_7::bench_imperative_reallocating_add_itermut ... bench: 246,488 ns/iter (+/- 1,615)
test _b_7::linebreak_applicative_imperative_reallocating ... bench: 0 ns/iter (+/- 0)
test _b_8::bench_applicative_add_index ... bench: 390,786 ns/iter (+/- 5,833)
test _b_8::bench_applicative_add_itermut ... bench: 395,830 ns/iter (+/- 4,165)
test _b_8::bench_imperative_add_index ... bench: 236,522 ns/iter (+/- 1,089)
test _b_8::bench_imperative_add_itermut ... bench: 121,987 ns/iter (+/- 462)
test _b_8::bench_imperative_reallocating_add_index ... bench: 666,722 ns/iter (+/- 10,072)
test _b_8::bench_imperative_reallocating_add_itermut ... bench: 504,304 ns/iter (+/- 13,070)
test _b_8::linebreak_applicative_imperative_reallocating ... bench: 0 ns/iter (+/- 0)
test _b_9::bench_applicative_add_index ... bench: 822,417 ns/iter (+/- 11,205)
test _b_9::bench_applicative_add_itermut ... bench: 827,093 ns/iter (+/- 10,132)
test _b_9::bench_imperative_add_index ... bench: 478,568 ns/iter (+/- 3,008)
test _b_9::bench_imperative_add_itermut ... bench: 250,486 ns/iter (+/- 1,647)
test _b_9::bench_imperative_reallocating_add_index ... bench: 1,386,815 ns/iter (+/- 30,359)
test _b_9::bench_imperative_reallocating_add_itermut ... bench: 1,093,889 ns/iter (+/- 10,134)
test _b_9::linebreak_applicative_imperative_reallocating ... bench: 0 ns/iter (+/- 0) |
Your code is not the same as in the advice it's a different case 😄. I spent some time looking into lock step iteration and not found anything better to recommend than the slice to equal length and iterate method. It requires a quite exact adherence to the template, and it will optimize very well. Forum post: https://users.rust-lang.org/t/how-to-zip-two-slices-efficiently/2048/14
We never claim this. Iterators are not a silver bullet and we should not pretend they are. What we regularly celebrate is how well the slice iterator compiles. Because it does, in the very simplest cases, like exactly these:
and so on. Lock step iteration (zip) is a completely different beast, and we know we are far from optimal there. See also this thread. |
This is how to write it in the "how to zip slices" approach and it gives a 5-10x speedup over the best of the other cases. (Because it autovectorizes). pub fn imperative_add_vec_vec_zipslices(a: &Vec<i32>, b: &Vec<i32>, c: &mut Vec<i32>) {
let len = min(a.len(), b.len());
let len = min(len, c.len());
let a = &a[..len];
let b = &b[..len];
let c = &mut c[..len];
for i in 0..len {
c[i] = a[i] + b[i];
}
} |
@dotdash the machine is a Core i7-4980HQ @2.8Ghz. I didn't think it was too heavily loaded when I did the runs, but then again I did have two Firefox instances running, so who knows... I'll try it again after shutting those down and rebooting, to see if the variance drops. the results you showed in your run are very encouraging; if the applicative cases match this well for most target configurations, then we can probably just close this ticket. (But I want to at least try a run on my desktop machine to see if I can replicate your results there.) |
@bluss hmm thanks for that post. I guess I won't be able to show off the "prettiest" code in this case and still claim that its the best you could hope to get. |
I'm not sure what the result should be for any of the loops, because they don't ensure that all indices visited in the loop are in bounds before starting. I believe that's a boundary which will be very hard to overcome. What should eventually work here is @pnkfelix Covering this approach is going to be an important part of my ndarray talk in march. Ndarray put this know-how into use for efficient ndarray-and-ndarray operations. (No hand rolled matrix multiply though). |
@pnkfelix Another downer from the harsh reality of rust + numerics Consider the sum of a sequence:
This is also related to matrix multiplication, so I thought I'd mention it.. |
@bluss It seems like you are focusing on the fact that all of my loops, even the ones that use Is it wrong of me to think that if
Update: fixed typo and reordered some words to try to make my text clearer. |
@pnkfelix Yes. I just have a very square head; both of those loops have bounds check and branch to panic in the loop body, so I didn't think anything else would be important (and it basically isn't). A branch to panic inside a loop will disrupt many potential optimizations, even seemingly unrelated ones. itermut is slightly better on an x86-64 Sandy Bridge laptop
|
And yes, pub fn imperative_add_vec_vec_zip(a: &Vec<i32>, b: &Vec<i32>, c: &mut Vec<i32>) {
for ((a, b), c) in a.iter().zip(b).zip(c) {
*c = *a + *b;
}
}
|
@dotdash just an FYI: I redid the benchmarks on this Mac after logging in and not firing up any application but Terminal; the results were odd: the overall elapsed times themselves seem like they went down, at least for the small inputs (so there probably was interference from background tasks), but the variance if anything seems to have gone up. Ugh.
The most important thing is that the results for the pairs of applicative cases still illustrate iter_mut to be significantly more expensive than indexing; I'll throw out the small inputs since the variance on them really is ridiculously large. Here's what's left, in terms of the comparison of applicative approaches.
(I'm going to try revising the benchmark, or at least add new variants, to use @bluss -suggested technique of first extracting slices of known size for the two inputs, just to try to making the source expression code as simple as possible, and see what that leads to.) |
(I'm also going to try running the benchmark on a desktop linux machine I have handy...) |
Okay, apparently @bluss 's suggestion does make a huge difference to how well LLVM is able to optimize writing to the
The performance delta between index and itermut is much more like what I was expecting to see for the "resl" cases. So now I'm not sure if there's really anything actionable here for now... |
However: There is some hiccup in optimizing |
@pnkfelix I don't think I'm entirely following the discussion here, but can we close this? You mentioned in your last comment that there might not be anything actionable for now, but I don't know quite what we'd do here in the future either... |
I'm going to go ahead and close this since it looks like this isn't actionable currently. |
While working on some demonstration code involving a matrix multiply, I discovered that the claim that Rust's iterator abstractions boil away to something competitive with hand-written assembly does not hold for a
for
-loop overslice.iter_mut
.That is, I would like to be able to give the following blanket advice:
But here is a concrete example where that is not good advice:
I made a benchmark to investigate the performance of the above two code snippets; each of the b_0/b_1/b_2/.../b_9 cases below are respectively adding vectors of length 1,000; 2,000; 4,000; ...; 512,000.
The index-based version is consistently beating the
iter_mut
, with the latter exhibiting slowdown in the range of 1.2 to 1.3x.iter_mut
?I eventually realized that the destination
Vector
being allocated within the benchmark iteration was certainly adding overhead to the benchmark that I wasn't necessarily interested in measuring.This led me to make variations on the above benchmark with an imperative signature (
fn (&Vec<i32>, &Vec<i32>, &mut Vec<i32>)
): one that allocates the vector outside the benchmarkiter(|| ...)
call, and another that reallocates within the closure passed toiter
, but still retains that imperative signature for the function being benchmarked.This allows us to isolate how different operations compare; i.e. we can directly observe how a for loop with indexing works versus iter_mut when LLVM does not know how large the actual given vector is that is being mutably traversed.
Here then is the full benchmark code:
And here are results of running the above benchmark on my laptop. (The "linebreak" entries are, as the name indicates, just to break up the different data sets in the right hand side.)
The main conclusions I draw from these data sets are:
iter_mut
will perform better than indexing for initializing a destination vector. (This is based on the lines labelled "imperative" and comparing the "index" versus "itermut" pairs.)The text was updated successfully, but these errors were encountered: