Description
While working on some demonstration code involving a matrix multiply, I discovered that the claim that Rust's iterator abstractions boil away to something competitive with hand-written assembly does not hold for a for
-loop over slice.iter_mut
.
That is, I would like to be able to give the following blanket advice:
You should prefer
for elem in vec.iter_mut() { *elem = ...; }over
for i in 0..n { vec[i] = ...; }as the first form will always be at least as fast as the second, and is sometimes faster
But here is a concrete example where that is not good advice:
pub fn applicative_add_vec_vec_index(a: &Vec<i32>, b: &Vec<i32>) -> Vec<i32> {
let n = a.len();
let mut c = vec![0; n];
for i in 0..n { // ⇐ COMPARING THIS...
c[i] = a[i] + b[i];
}
return c;
}
pub fn applicative_add_vec_vec_itermut(a: &Vec<i32>, b: &Vec<i32>) -> Vec<i32> {
let n = a.len();
let mut i = 0;
let mut c = vec![0; n];
for c_i in c.iter_mut() { // ⇐ ... TO THIS.
*c_i = a[i] + b[i];
i += 1;
}
return c;
}
I made a benchmark to investigate the performance of the above two code snippets; each of the b_0/b_1/b_2/.../b_9 cases below are respectively adding vectors of length 1,000; 2,000; 4,000; ...; 512,000.
test _b_0::bench_applicative_add_index ... bench: 708 ns/iter (+/- 33)
test _b_0::bench_applicative_add_itermut ... bench: 1,065 ns/iter (+/- 66)
test _b_1::bench_applicative_add_index ... bench: 1,418 ns/iter (+/- 545)
test _b_1::bench_applicative_add_itermut ... bench: 2,126 ns/iter (+/- 793)
test _b_2::bench_applicative_add_index ... bench: 3,038 ns/iter (+/- 715)
test _b_2::bench_applicative_add_itermut ... bench: 4,223 ns/iter (+/- 705)
test _b_3::bench_applicative_add_index ... bench: 5,883 ns/iter (+/- 2,503)
test _b_3::bench_applicative_add_itermut ... bench: 8,480 ns/iter (+/- 4,205)
test _b_4::bench_applicative_add_index ... bench: 12,255 ns/iter (+/- 949)
test _b_4::bench_applicative_add_itermut ... bench: 17,348 ns/iter (+/- 8,314)
test _b_5::bench_applicative_add_index ... bench: 39,095 ns/iter (+/- 10,043)
test _b_5::bench_applicative_add_itermut ... bench: 48,762 ns/iter (+/- 15,931)
test _b_6::bench_applicative_add_index ... bench: 76,462 ns/iter (+/- 26,995)
test _b_6::bench_applicative_add_itermut ... bench: 94,892 ns/iter (+/- 34,273)
test _b_7::bench_applicative_add_index ... bench: 150,292 ns/iter (+/- 61,944)
test _b_7::bench_applicative_add_itermut ... bench: 186,523 ns/iter (+/- 81,613)
test _b_8::bench_applicative_add_index ... bench: 312,718 ns/iter (+/- 64,088)
test _b_8::bench_applicative_add_itermut ... bench: 399,138 ns/iter (+/- 68,355)
test _b_9::bench_applicative_add_index ... bench: 654,565 ns/iter (+/- 112,684)
test _b_9::bench_applicative_add_itermut ... bench: 852,169 ns/iter (+/- 131,565)
The index-based version is consistently beating the iter_mut
, with the latter exhibiting slowdown in the range of 1.2 to 1.3x.
- That is the primary thing I am interested in trying to address in filing this ticket: Can we get code generation above to work for the functions above so that there is no significant performance hit for using
iter_mut
? - (Ideally it would be faster than using indexing, but I would be satisfied if it was just roughly the same running times for both.)
- I did bisection over the nightlies with multirust, and determined that the above two code snippets used to have the same performance profile. Between nightly-2015-06-17 0250ff9 and nightly-2015-06-18 20d23d8 , the index based version got much faster, while the iter_mut version also got slower, yielding an observed performance difference similar to that documented above.
I eventually realized that the destination Vector
being allocated within the benchmark iteration was certainly adding overhead to the benchmark that I wasn't necessarily interested in measuring.
This led me to make variations on the above benchmark with an imperative signature (fn (&Vec<i32>, &Vec<i32>, &mut Vec<i32>)
): one that allocates the vector outside the benchmark iter(|| ...)
call, and another that reallocates within the closure passed to iter
, but still retains that imperative signature for the function being benchmarked.
This allows us to isolate how different operations compare; i.e. we can directly observe how a for loop with indexing works versus iter_mut when LLVM does not know how large the actual given vector is that is being mutably traversed.
Here then is the full benchmark code:
#![feature(test)]
extern crate test;
pub fn imperative_add_vec_vec_index(a: &Vec<i32>, b: &Vec<i32>, c: &mut Vec<i32>) {
let n = a.len();
for i in 0..n { // ⇐ COMPARING THIS...
c[i] = a[i] + b[i];
}
}
pub fn imperative_add_vec_vec_itermut(a: &Vec<i32>, b: &Vec<i32>, c: &mut Vec<i32>) {
let mut i = 0;
for c_i in c.iter_mut() { // ⇐ ... TO THIS.
*c_i = a[i] + b[i];
i += 1;
}
}
pub fn applicative_add_vec_vec_index(a: &Vec<i32>, b: &Vec<i32>) -> Vec<i32> {
let n = a.len();
let mut c = vec![0; n];
for i in 0..n { // ⇐ COMPARING THIS...
c[i] = a[i] + b[i];
}
return c;
}
pub fn applicative_add_vec_vec_itermut(a: &Vec<i32>, b: &Vec<i32>) -> Vec<i32> {
let n = a.len();
let mut i = 0;
let mut c = vec![0; n];
for c_i in c.iter_mut() { // ⇐ ... TO THIS.
*c_i = a[i] + b[i];
i += 1;
}
return c;
}
#[test]
fn test_add_tens_tens_index() {
let mut c = vec![0, 0, 0];
imperative_add_vec_vec_index(&vec![10,20,30],
&vec![10,20,30],
&mut c);
assert_eq!(c, vec![20, 40, 60]);
c = applicative_add_vec_vec_index(&vec![10,20,30],
&vec![10,20,30]);
assert_eq!(c, vec![20, 40, 60]);
}
#[test]
fn test_add_tens_tens_itermut() {
let mut c = vec![0, 0, 0];
imperative_add_vec_vec_itermut(&vec![10,20,30],
&vec![10,20,30],
&mut c);
assert_eq!(c, vec![20, 40, 60]);
c = applicative_add_vec_vec_index(&vec![10,20,30],
&vec![10,20,30]);
assert_eq!(c, vec![20, 40, 60]);
}
macro_rules! add_benches {
($prefix: ident, $size: expr) => {
pub mod $prefix {
#[bench]
fn bench_imperative_add_index(b: &mut ::test::Bencher) {
let u = &vec![10; $size];
let v = &vec![10; $size];
let c = &mut vec![0; $size];
b.iter(|| super::imperative_add_vec_vec_index(u, v, c));
assert_eq!(c, &mut vec![20; $size]);
}
#[bench]
fn bench_imperative_add_itermut(b: &mut ::test::Bencher) {
let u = &vec![10; $size];
let v = &vec![10; $size];
let c = &mut vec![0; $size];
b.iter(|| super::imperative_add_vec_vec_itermut(u, v, c));
assert_eq!(c, &mut vec![20; $size]);
}
#[bench]
fn bench_imperative_reallocating_add_index(b: &mut ::test::Bencher) {
let u = &vec![10; $size];
let v = &vec![10; $size];
b.iter(|| {
let c = &mut vec![0; $size];
super::imperative_add_vec_vec_index(u, v, c);
assert_eq!(c, &mut vec![20; $size]);
});
}
#[bench]
fn bench_imperative_reallocating_add_itermut(b: &mut ::test::Bencher) {
let u = &vec![10; $size];
let v = &vec![10; $size];
b.iter(|| {
let c = &mut vec![0; $size];
super::imperative_add_vec_vec_itermut(u, v, c);
assert_eq!(c, &mut vec![20; $size]);
});
}
#[bench]
fn bench_applicative_add_index(b: &mut ::test::Bencher) {
let u = &vec![10; $size];
let v = &vec![10; $size];
let mut c = vec![0; $size];
b.iter(|| c = super::applicative_add_vec_vec_index(u, v));
assert_eq!(c, vec![20; $size]);
}
#[bench]
fn bench_applicative_add_itermut(b: &mut ::test::Bencher) {
let u = &vec![10; $size];
let v = &vec![10; $size];
let mut c = vec![0; $size];
b.iter(|| c = super::applicative_add_vec_vec_itermut(u, v));
assert_eq!(c, vec![20; $size]);
}
#[bench]
fn linebreak_applicative_imperative_reallocating(_: &mut ::test::Bencher) { }
}
}
}
add_benches!(_b_0, 1_000);
add_benches!(_b_1, 2_000);
add_benches!(_b_2, 4_000);
add_benches!(_b_3, 8_000);
add_benches!(_b_4, 16_000);
add_benches!(_b_5, 32_000);
add_benches!(_b_6, 64_000);
add_benches!(_b_7, 128_000);
add_benches!(_b_8, 256_000);
add_benches!(_b_9, 512_000);
And here are results of running the above benchmark on my laptop. (The "linebreak" entries are, as the name indicates, just to break up the different data sets in the right hand side.)
% multirust show-default && cargo test test && cargo bench
multirust: default toolchain: nightly
multirust: default location: /Users/fklock/.multirust/toolchains/nightly
rustc 1.8.0-nightly (0ef8d4260 2016-02-24)
cargo 0.9.0-nightly (e721289 2016-02-25)
Running target/debug/iter_mut_bench-1f3abd6a849082b3
running 2 tests
test test_add_tens_tens_index ... ok
test test_add_tens_tens_itermut ... ok
test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured
Doc-tests iter_mut_bench
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured
Running target/release/iter_mut_bench-1f3abd6a849082b3
running 72 tests
test test_add_tens_tens_index ... ignored
test test_add_tens_tens_itermut ... ignored
test _b_0::bench_applicative_add_index ... bench: 760 ns/iter (+/- 102)
test _b_0::bench_applicative_add_itermut ... bench: 1,079 ns/iter (+/- 172)
test _b_0::bench_imperative_add_index ... bench: 766 ns/iter (+/- 142)
test _b_0::bench_imperative_add_itermut ... bench: 649 ns/iter (+/- 141)
test _b_0::bench_imperative_reallocating_add_index ... bench: 1,377 ns/iter (+/- 342)
test _b_0::bench_imperative_reallocating_add_itermut ... bench: 1,031 ns/iter (+/- 254)
test _b_0::linebreak_applicative_imperative_reallocating ... bench: 0 ns/iter (+/- 0)
test _b_1::bench_applicative_add_index ... bench: 1,735 ns/iter (+/- 1,079)
test _b_1::bench_applicative_add_itermut ... bench: 3,082 ns/iter (+/- 5,055)
test _b_1::bench_imperative_add_index ... bench: 1,530 ns/iter (+/- 433)
test _b_1::bench_imperative_add_itermut ... bench: 1,277 ns/iter (+/- 368)
test _b_1::bench_imperative_reallocating_add_index ... bench: 2,870 ns/iter (+/- 1,037)
test _b_1::bench_imperative_reallocating_add_itermut ... bench: 2,126 ns/iter (+/- 2,189)
test _b_1::linebreak_applicative_imperative_reallocating ... bench: 0 ns/iter (+/- 0)
test _b_2::bench_applicative_add_index ... bench: 3,375 ns/iter (+/- 1,176)
test _b_2::bench_applicative_add_itermut ... bench: 5,092 ns/iter (+/- 7,258)
test _b_2::bench_imperative_add_index ... bench: 3,163 ns/iter (+/- 2,309)
test _b_2::bench_imperative_add_itermut ... bench: 2,740 ns/iter (+/- 3,348)
test _b_2::bench_imperative_reallocating_add_index ... bench: 5,820 ns/iter (+/- 4,686)
test _b_2::bench_imperative_reallocating_add_itermut ... bench: 4,496 ns/iter (+/- 1,314)
test _b_2::linebreak_applicative_imperative_reallocating ... bench: 0 ns/iter (+/- 0)
test _b_3::bench_applicative_add_index ... bench: 7,107 ns/iter (+/- 4,992)
test _b_3::bench_applicative_add_itermut ... bench: 10,039 ns/iter (+/- 5,720)
test _b_3::bench_imperative_add_index ... bench: 6,395 ns/iter (+/- 4,394)
test _b_3::bench_imperative_add_itermut ... bench: 5,488 ns/iter (+/- 7,558)
test _b_3::bench_imperative_reallocating_add_index ... bench: 11,609 ns/iter (+/- 10,878)
test _b_3::bench_imperative_reallocating_add_itermut ... bench: 9,217 ns/iter (+/- 5,291)
test _b_3::linebreak_applicative_imperative_reallocating ... bench: 0 ns/iter (+/- 0)
test _b_4::bench_applicative_add_index ... bench: 13,699 ns/iter (+/- 16,487)
test _b_4::bench_applicative_add_itermut ... bench: 19,223 ns/iter (+/- 2,807)
test _b_4::bench_imperative_add_index ... bench: 11,990 ns/iter (+/- 1,706)
test _b_4::bench_imperative_add_itermut ... bench: 10,324 ns/iter (+/- 2,901)
test _b_4::bench_imperative_reallocating_add_index ... bench: 37,090 ns/iter (+/- 13,541)
test _b_4::bench_imperative_reallocating_add_itermut ... bench: 32,028 ns/iter (+/- 9,136)
test _b_4::linebreak_applicative_imperative_reallocating ... bench: 0 ns/iter (+/- 0)
test _b_5::bench_applicative_add_index ... bench: 45,018 ns/iter (+/- 20,031)
test _b_5::bench_applicative_add_itermut ... bench: 55,431 ns/iter (+/- 57,384)
test _b_5::bench_imperative_add_index ... bench: 27,231 ns/iter (+/- 15,549)
test _b_5::bench_imperative_add_itermut ... bench: 21,901 ns/iter (+/- 31,860)
test _b_5::bench_imperative_reallocating_add_index ... bench: 75,070 ns/iter (+/- 14,594)
test _b_5::bench_imperative_reallocating_add_itermut ... bench: 66,472 ns/iter (+/- 19,059)
test _b_5::linebreak_applicative_imperative_reallocating ... bench: 0 ns/iter (+/- 0)
test _b_6::bench_applicative_add_index ... bench: 83,719 ns/iter (+/- 13,320)
test _b_6::bench_applicative_add_itermut ... bench: 110,591 ns/iter (+/- 23,880)
test _b_6::bench_imperative_add_index ... bench: 52,350 ns/iter (+/- 11,850)
test _b_6::bench_imperative_add_itermut ... bench: 45,233 ns/iter (+/- 13,832)
test _b_6::bench_imperative_reallocating_add_index ... bench: 139,599 ns/iter (+/- 36,488)
test _b_6::bench_imperative_reallocating_add_itermut ... bench: 115,499 ns/iter (+/- 41,773)
test _b_6::linebreak_applicative_imperative_reallocating ... bench: 0 ns/iter (+/- 0)
test _b_7::bench_applicative_add_index ... bench: 152,048 ns/iter (+/- 23,623)
test _b_7::bench_applicative_add_itermut ... bench: 195,796 ns/iter (+/- 24,616)
test _b_7::bench_imperative_add_index ... bench: 93,273 ns/iter (+/- 17,124)
test _b_7::bench_imperative_add_itermut ... bench: 82,290 ns/iter (+/- 14,308)
test _b_7::bench_imperative_reallocating_add_index ... bench: 260,315 ns/iter (+/- 42,152)
test _b_7::bench_imperative_reallocating_add_itermut ... bench: 222,595 ns/iter (+/- 28,584)
test _b_7::linebreak_applicative_imperative_reallocating ... bench: 0 ns/iter (+/- 0)
test _b_8::bench_applicative_add_index ... bench: 320,524 ns/iter (+/- 46,424)
test _b_8::bench_applicative_add_itermut ... bench: 419,922 ns/iter (+/- 45,263)
test _b_8::bench_imperative_add_index ... bench: 191,193 ns/iter (+/- 32,244)
test _b_8::bench_imperative_add_itermut ... bench: 161,733 ns/iter (+/- 29,028)
test _b_8::bench_imperative_reallocating_add_index ... bench: 574,651 ns/iter (+/- 96,008)
test _b_8::bench_imperative_reallocating_add_itermut ... bench: 513,109 ns/iter (+/- 63,440)
test _b_8::linebreak_applicative_imperative_reallocating ... bench: 0 ns/iter (+/- 0)
test _b_9::bench_applicative_add_index ... bench: 673,698 ns/iter (+/- 114,850)
test _b_9::bench_applicative_add_itermut ... bench: 842,234 ns/iter (+/- 130,654)
test _b_9::bench_imperative_add_index ... bench: 381,448 ns/iter (+/- 76,419)
test _b_9::bench_imperative_add_itermut ... bench: 339,002 ns/iter (+/- 63,421)
test _b_9::bench_imperative_reallocating_add_index ... bench: 1,705,358 ns/iter (+/- 231,283)
test _b_9::bench_imperative_reallocating_add_itermut ... bench: 1,533,730 ns/iter (+/- 253,143)
test _b_9::linebreak_applicative_imperative_reallocating ... bench: 0 ns/iter (+/- 0)
test result: ok. 0 passed; 0 failed; 2 ignored; 70 measured
The main conclusions I draw from these data sets are:
- If you avoid reallocation (which means you don't use applicative style), then
iter_mut
will perform better than indexing for initializing a destination vector. (This is based on the lines labelled "imperative" and comparing the "index" versus "itermut" pairs.) - If you reallocate on each iteration and use "iter_mut", then for some bizarre reason on small vectors ("b_0", "b_1", "b_2", "b_3") doing the reallocation outside of the initialization itself ("imperative_reallocating_add_itermut") will perform better than applicative style ("applicative_add_itermut"). At some threshold ("b_4" and above) the applicative itermut is faster than the reallocating itermut.
- There is a huge performance difference between "imperative_reallocating_add_index" and "applicative_add_index." My current hypothesis is that LLVM is doing some amazing work to optimize all of the index based accesses when the vector is allocated within the same function (and thus LLVM can see how large the backing store for the vector is.)