Skip to content

slice.iter_mut is not zero cost (vs IndexMut) when initializing a fresh vector #31890

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pnkfelix opened this issue Feb 25, 2016 · 17 comments
Closed
Labels
C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@pnkfelix
Copy link
Member

While working on some demonstration code involving a matrix multiply, I discovered that the claim that Rust's iterator abstractions boil away to something competitive with hand-written assembly does not hold for a for-loop over slice.iter_mut.

That is, I would like to be able to give the following blanket advice:

You should prefer

for elem in vec.iter_mut() { *elem = ...; }

over

for i in 0..n { vec[i] = ...; }

as the first form will always be at least as fast as the second, and is sometimes faster

But here is a concrete example where that is not good advice:

pub fn applicative_add_vec_vec_index(a: &Vec<i32>, b: &Vec<i32>) -> Vec<i32> {
    let n = a.len();
    let mut c = vec![0; n];
    for i in 0..n {         // ⇐ COMPARING THIS...
        c[i] = a[i] + b[i];
    }
    return c;
}

pub fn applicative_add_vec_vec_itermut(a: &Vec<i32>, b: &Vec<i32>) -> Vec<i32> {
    let n = a.len();
    let mut i = 0;
    let mut c = vec![0; n];
    for c_i in c.iter_mut() { // ⇐ ... TO THIS.
        *c_i = a[i] + b[i];
        i += 1;
    }
    return c;
}

I made a benchmark to investigate the performance of the above two code snippets; each of the b_0/b_1/b_2/.../b_9 cases below are respectively adding vectors of length 1,000; 2,000; 4,000; ...; 512,000.

test _b_0::bench_applicative_add_index                   ... bench:         708 ns/iter (+/- 33)
test _b_0::bench_applicative_add_itermut                 ... bench:       1,065 ns/iter (+/- 66)

test _b_1::bench_applicative_add_index                   ... bench:       1,418 ns/iter (+/- 545)
test _b_1::bench_applicative_add_itermut                 ... bench:       2,126 ns/iter (+/- 793)

test _b_2::bench_applicative_add_index                   ... bench:       3,038 ns/iter (+/- 715)
test _b_2::bench_applicative_add_itermut                 ... bench:       4,223 ns/iter (+/- 705)

test _b_3::bench_applicative_add_index                   ... bench:       5,883 ns/iter (+/- 2,503)
test _b_3::bench_applicative_add_itermut                 ... bench:       8,480 ns/iter (+/- 4,205)

test _b_4::bench_applicative_add_index                   ... bench:      12,255 ns/iter (+/- 949)
test _b_4::bench_applicative_add_itermut                 ... bench:      17,348 ns/iter (+/- 8,314)

test _b_5::bench_applicative_add_index                   ... bench:      39,095 ns/iter (+/- 10,043)
test _b_5::bench_applicative_add_itermut                 ... bench:      48,762 ns/iter (+/- 15,931)

test _b_6::bench_applicative_add_index                   ... bench:      76,462 ns/iter (+/- 26,995)
test _b_6::bench_applicative_add_itermut                 ... bench:      94,892 ns/iter (+/- 34,273)

test _b_7::bench_applicative_add_index                   ... bench:     150,292 ns/iter (+/- 61,944)
test _b_7::bench_applicative_add_itermut                 ... bench:     186,523 ns/iter (+/- 81,613)

test _b_8::bench_applicative_add_index                   ... bench:     312,718 ns/iter (+/- 64,088)
test _b_8::bench_applicative_add_itermut                 ... bench:     399,138 ns/iter (+/- 68,355)

test _b_9::bench_applicative_add_index                   ... bench:     654,565 ns/iter (+/- 112,684)
test _b_9::bench_applicative_add_itermut                 ... bench:     852,169 ns/iter (+/- 131,565)

The index-based version is consistently beating the iter_mut, with the latter exhibiting slowdown in the range of 1.2 to 1.3x.

  • That is the primary thing I am interested in trying to address in filing this ticket: Can we get code generation above to work for the functions above so that there is no significant performance hit for using iter_mut ?
  • (Ideally it would be faster than using indexing, but I would be satisfied if it was just roughly the same running times for both.)
  • I did bisection over the nightlies with multirust, and determined that the above two code snippets used to have the same performance profile. Between nightly-2015-06-17 0250ff9 and nightly-2015-06-18 20d23d8 , the index based version got much faster, while the iter_mut version also got slower, yielding an observed performance difference similar to that documented above.

I eventually realized that the destination Vector being allocated within the benchmark iteration was certainly adding overhead to the benchmark that I wasn't necessarily interested in measuring.

This led me to make variations on the above benchmark with an imperative signature (fn (&Vec<i32>, &Vec<i32>, &mut Vec<i32>)): one that allocates the vector outside the benchmark iter(|| ...) call, and another that reallocates within the closure passed to iter, but still retains that imperative signature for the function being benchmarked.

This allows us to isolate how different operations compare; i.e. we can directly observe how a for loop with indexing works versus iter_mut when LLVM does not know how large the actual given vector is that is being mutably traversed.

Here then is the full benchmark code:

#![feature(test)]
extern crate test;

pub fn imperative_add_vec_vec_index(a: &Vec<i32>, b: &Vec<i32>, c: &mut Vec<i32>) {
    let n = a.len();
    for i in 0..n {         // ⇐ COMPARING THIS...
        c[i] = a[i] + b[i];
    }
}

pub fn imperative_add_vec_vec_itermut(a: &Vec<i32>, b: &Vec<i32>, c: &mut Vec<i32>) {
    let mut i = 0;
    for c_i in c.iter_mut() { // ⇐ ... TO THIS.
        *c_i = a[i] + b[i];
        i += 1;
    }
}

pub fn applicative_add_vec_vec_index(a: &Vec<i32>, b: &Vec<i32>) -> Vec<i32> {
    let n = a.len();
    let mut c = vec![0; n];
    for i in 0..n {         // ⇐ COMPARING THIS...
        c[i] = a[i] + b[i];
    }
    return c;
}

pub fn applicative_add_vec_vec_itermut(a: &Vec<i32>, b: &Vec<i32>) -> Vec<i32> {
    let n = a.len();
    let mut i = 0;
    let mut c = vec![0; n];
    for c_i in c.iter_mut() { // ⇐ ... TO THIS.
        *c_i = a[i] + b[i];
        i += 1;
    }
    return c;
}

#[test]
fn test_add_tens_tens_index() {
    let mut c = vec![0, 0, 0];
    imperative_add_vec_vec_index(&vec![10,20,30],
                                 &vec![10,20,30],
                                 &mut c);
    assert_eq!(c, vec![20, 40, 60]);
    c = applicative_add_vec_vec_index(&vec![10,20,30],
                                      &vec![10,20,30]);
    assert_eq!(c, vec![20, 40, 60]);
}

#[test]
fn test_add_tens_tens_itermut() {
    let mut c = vec![0, 0, 0];
    imperative_add_vec_vec_itermut(&vec![10,20,30],
                                   &vec![10,20,30],
                                   &mut c);
    assert_eq!(c, vec![20, 40, 60]);
    c = applicative_add_vec_vec_index(&vec![10,20,30],
                                      &vec![10,20,30]);
    assert_eq!(c, vec![20, 40, 60]);
}

macro_rules! add_benches {
    ($prefix: ident, $size: expr) => {
        pub mod $prefix {
            #[bench]
            fn bench_imperative_add_index(b: &mut ::test::Bencher) {
                let u = &vec![10; $size];
                let v = &vec![10; $size];
                let c = &mut vec![0; $size];
                b.iter(|| super::imperative_add_vec_vec_index(u, v, c));
                assert_eq!(c, &mut vec![20; $size]);
            }

            #[bench]
            fn bench_imperative_add_itermut(b: &mut ::test::Bencher) {
                let u = &vec![10; $size];
                let v = &vec![10; $size];
                let c = &mut vec![0; $size];
                b.iter(|| super::imperative_add_vec_vec_itermut(u, v, c));
                assert_eq!(c, &mut vec![20; $size]);
            }

            #[bench]
            fn bench_imperative_reallocating_add_index(b: &mut ::test::Bencher) {
                let u = &vec![10; $size];
                let v = &vec![10; $size];
                b.iter(|| {
                    let c = &mut vec![0; $size];
                    super::imperative_add_vec_vec_index(u, v, c);
                    assert_eq!(c, &mut vec![20; $size]);
                });
            }

            #[bench]
            fn bench_imperative_reallocating_add_itermut(b: &mut ::test::Bencher) {
                let u = &vec![10; $size];
                let v = &vec![10; $size];
                b.iter(|| {
                    let c = &mut vec![0; $size];
                    super::imperative_add_vec_vec_itermut(u, v, c);
                    assert_eq!(c, &mut vec![20; $size]);
                });
            }

            #[bench]
            fn bench_applicative_add_index(b: &mut ::test::Bencher) {
                let u = &vec![10; $size];
                let v = &vec![10; $size];
                let mut c = vec![0; $size];
                b.iter(|| c = super::applicative_add_vec_vec_index(u, v));
                assert_eq!(c, vec![20; $size]);
            }

            #[bench]
            fn bench_applicative_add_itermut(b: &mut ::test::Bencher) {
                let u = &vec![10; $size];
                let v = &vec![10; $size];
                let mut c = vec![0; $size];
                b.iter(|| c = super::applicative_add_vec_vec_itermut(u, v));
                assert_eq!(c, vec![20; $size]);
            }

            #[bench]
            fn linebreak_applicative_imperative_reallocating(_: &mut ::test::Bencher) { }
        }
    }
}

add_benches!(_b_0,    1_000);
add_benches!(_b_1,    2_000);
add_benches!(_b_2,    4_000);
add_benches!(_b_3,    8_000);
add_benches!(_b_4,   16_000);
add_benches!(_b_5,   32_000);
add_benches!(_b_6,   64_000);
add_benches!(_b_7,  128_000);
add_benches!(_b_8,  256_000);
add_benches!(_b_9,  512_000);

And here are results of running the above benchmark on my laptop. (The "linebreak" entries are, as the name indicates, just to break up the different data sets in the right hand side.)

% multirust show-default && cargo test test && cargo bench
multirust: default toolchain: nightly
multirust: default location: /Users/fklock/.multirust/toolchains/nightly

rustc 1.8.0-nightly (0ef8d4260 2016-02-24)
cargo 0.9.0-nightly (e721289 2016-02-25)

     Running target/debug/iter_mut_bench-1f3abd6a849082b3

running 2 tests
test test_add_tens_tens_index ... ok
test test_add_tens_tens_itermut ... ok

test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured

   Doc-tests iter_mut_bench

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured

     Running target/release/iter_mut_bench-1f3abd6a849082b3

running 72 tests
test test_add_tens_tens_index ... ignored
test test_add_tens_tens_itermut ... ignored
test _b_0::bench_applicative_add_index                   ... bench:         760 ns/iter (+/- 102)
test _b_0::bench_applicative_add_itermut                 ... bench:       1,079 ns/iter (+/- 172)
test _b_0::bench_imperative_add_index                    ... bench:         766 ns/iter (+/- 142)
test _b_0::bench_imperative_add_itermut                  ... bench:         649 ns/iter (+/- 141)
test _b_0::bench_imperative_reallocating_add_index       ... bench:       1,377 ns/iter (+/- 342)
test _b_0::bench_imperative_reallocating_add_itermut     ... bench:       1,031 ns/iter (+/- 254)
test _b_0::linebreak_applicative_imperative_reallocating ... bench:           0 ns/iter (+/- 0)
test _b_1::bench_applicative_add_index                   ... bench:       1,735 ns/iter (+/- 1,079)
test _b_1::bench_applicative_add_itermut                 ... bench:       3,082 ns/iter (+/- 5,055)
test _b_1::bench_imperative_add_index                    ... bench:       1,530 ns/iter (+/- 433)
test _b_1::bench_imperative_add_itermut                  ... bench:       1,277 ns/iter (+/- 368)
test _b_1::bench_imperative_reallocating_add_index       ... bench:       2,870 ns/iter (+/- 1,037)
test _b_1::bench_imperative_reallocating_add_itermut     ... bench:       2,126 ns/iter (+/- 2,189)
test _b_1::linebreak_applicative_imperative_reallocating ... bench:           0 ns/iter (+/- 0)
test _b_2::bench_applicative_add_index                   ... bench:       3,375 ns/iter (+/- 1,176)
test _b_2::bench_applicative_add_itermut                 ... bench:       5,092 ns/iter (+/- 7,258)
test _b_2::bench_imperative_add_index                    ... bench:       3,163 ns/iter (+/- 2,309)
test _b_2::bench_imperative_add_itermut                  ... bench:       2,740 ns/iter (+/- 3,348)
test _b_2::bench_imperative_reallocating_add_index       ... bench:       5,820 ns/iter (+/- 4,686)
test _b_2::bench_imperative_reallocating_add_itermut     ... bench:       4,496 ns/iter (+/- 1,314)
test _b_2::linebreak_applicative_imperative_reallocating ... bench:           0 ns/iter (+/- 0)
test _b_3::bench_applicative_add_index                   ... bench:       7,107 ns/iter (+/- 4,992)
test _b_3::bench_applicative_add_itermut                 ... bench:      10,039 ns/iter (+/- 5,720)
test _b_3::bench_imperative_add_index                    ... bench:       6,395 ns/iter (+/- 4,394)
test _b_3::bench_imperative_add_itermut                  ... bench:       5,488 ns/iter (+/- 7,558)
test _b_3::bench_imperative_reallocating_add_index       ... bench:      11,609 ns/iter (+/- 10,878)
test _b_3::bench_imperative_reallocating_add_itermut     ... bench:       9,217 ns/iter (+/- 5,291)
test _b_3::linebreak_applicative_imperative_reallocating ... bench:           0 ns/iter (+/- 0)
test _b_4::bench_applicative_add_index                   ... bench:      13,699 ns/iter (+/- 16,487)
test _b_4::bench_applicative_add_itermut                 ... bench:      19,223 ns/iter (+/- 2,807)
test _b_4::bench_imperative_add_index                    ... bench:      11,990 ns/iter (+/- 1,706)
test _b_4::bench_imperative_add_itermut                  ... bench:      10,324 ns/iter (+/- 2,901)
test _b_4::bench_imperative_reallocating_add_index       ... bench:      37,090 ns/iter (+/- 13,541)
test _b_4::bench_imperative_reallocating_add_itermut     ... bench:      32,028 ns/iter (+/- 9,136)
test _b_4::linebreak_applicative_imperative_reallocating ... bench:           0 ns/iter (+/- 0)
test _b_5::bench_applicative_add_index                   ... bench:      45,018 ns/iter (+/- 20,031)
test _b_5::bench_applicative_add_itermut                 ... bench:      55,431 ns/iter (+/- 57,384)
test _b_5::bench_imperative_add_index                    ... bench:      27,231 ns/iter (+/- 15,549)
test _b_5::bench_imperative_add_itermut                  ... bench:      21,901 ns/iter (+/- 31,860)
test _b_5::bench_imperative_reallocating_add_index       ... bench:      75,070 ns/iter (+/- 14,594)
test _b_5::bench_imperative_reallocating_add_itermut     ... bench:      66,472 ns/iter (+/- 19,059)
test _b_5::linebreak_applicative_imperative_reallocating ... bench:           0 ns/iter (+/- 0)
test _b_6::bench_applicative_add_index                   ... bench:      83,719 ns/iter (+/- 13,320)
test _b_6::bench_applicative_add_itermut                 ... bench:     110,591 ns/iter (+/- 23,880)
test _b_6::bench_imperative_add_index                    ... bench:      52,350 ns/iter (+/- 11,850)
test _b_6::bench_imperative_add_itermut                  ... bench:      45,233 ns/iter (+/- 13,832)
test _b_6::bench_imperative_reallocating_add_index       ... bench:     139,599 ns/iter (+/- 36,488)
test _b_6::bench_imperative_reallocating_add_itermut     ... bench:     115,499 ns/iter (+/- 41,773)
test _b_6::linebreak_applicative_imperative_reallocating ... bench:           0 ns/iter (+/- 0)
test _b_7::bench_applicative_add_index                   ... bench:     152,048 ns/iter (+/- 23,623)
test _b_7::bench_applicative_add_itermut                 ... bench:     195,796 ns/iter (+/- 24,616)
test _b_7::bench_imperative_add_index                    ... bench:      93,273 ns/iter (+/- 17,124)
test _b_7::bench_imperative_add_itermut                  ... bench:      82,290 ns/iter (+/- 14,308)
test _b_7::bench_imperative_reallocating_add_index       ... bench:     260,315 ns/iter (+/- 42,152)
test _b_7::bench_imperative_reallocating_add_itermut     ... bench:     222,595 ns/iter (+/- 28,584)
test _b_7::linebreak_applicative_imperative_reallocating ... bench:           0 ns/iter (+/- 0)
test _b_8::bench_applicative_add_index                   ... bench:     320,524 ns/iter (+/- 46,424)
test _b_8::bench_applicative_add_itermut                 ... bench:     419,922 ns/iter (+/- 45,263)
test _b_8::bench_imperative_add_index                    ... bench:     191,193 ns/iter (+/- 32,244)
test _b_8::bench_imperative_add_itermut                  ... bench:     161,733 ns/iter (+/- 29,028)
test _b_8::bench_imperative_reallocating_add_index       ... bench:     574,651 ns/iter (+/- 96,008)
test _b_8::bench_imperative_reallocating_add_itermut     ... bench:     513,109 ns/iter (+/- 63,440)
test _b_8::linebreak_applicative_imperative_reallocating ... bench:           0 ns/iter (+/- 0)
test _b_9::bench_applicative_add_index                   ... bench:     673,698 ns/iter (+/- 114,850)
test _b_9::bench_applicative_add_itermut                 ... bench:     842,234 ns/iter (+/- 130,654)
test _b_9::bench_imperative_add_index                    ... bench:     381,448 ns/iter (+/- 76,419)
test _b_9::bench_imperative_add_itermut                  ... bench:     339,002 ns/iter (+/- 63,421)
test _b_9::bench_imperative_reallocating_add_index       ... bench:   1,705,358 ns/iter (+/- 231,283)
test _b_9::bench_imperative_reallocating_add_itermut     ... bench:   1,533,730 ns/iter (+/- 253,143)
test _b_9::linebreak_applicative_imperative_reallocating ... bench:           0 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 2 ignored; 70 measured

The main conclusions I draw from these data sets are:

  • If you avoid reallocation (which means you don't use applicative style), then iter_mut will perform better than indexing for initializing a destination vector. (This is based on the lines labelled "imperative" and comparing the "index" versus "itermut" pairs.)
  • If you reallocate on each iteration and use "iter_mut", then for some bizarre reason on small vectors ("b_0", "b_1", "b_2", "b_3") doing the reallocation outside of the initialization itself ("imperative_reallocating_add_itermut") will perform better than applicative style ("applicative_add_itermut"). At some threshold ("b_4" and above) the applicative itermut is faster than the reallocating itermut.
  • There is a huge performance difference between "imperative_reallocating_add_index" and "applicative_add_index." My current hypothesis is that LLVM is doing some amazing work to optimize all of the index based accesses when the vector is allocated within the same function (and thus LLVM can see how large the backing store for the vector is.)
@aturon
Copy link
Member

aturon commented Feb 25, 2016

cc @bluss @dotdash

@steveklabnik steveklabnik added A-libs I-slow Issue: Problems and improvements with respect to performance of generated code. labels Feb 25, 2016
@dotdash
Copy link
Contributor

dotdash commented Feb 25, 2016

@pnkfelix What hardware is this running on? I don't see quite as much of a difference between applicative_add_index and applicative_iter_mut. Also, your results seems to have quite a large degree of variance, was there something CPU heavy running in the background? Or something else that messed with CPU frequencies or caused task switches or whatever?

This is what I get on a mostly idle [email protected]

rustc 1.8.0-nightly (0ef8d4260 2016-02-24)

Press ENTER or type command to continue

running 72 tests
test test_add_tens_tens_index ... ignored
test test_add_tens_tens_itermut ... ignored
test _b_0::bench_applicative_add_index                   ... bench:         994 ns/iter (+/- 3)
test _b_0::bench_applicative_add_itermut                 ... bench:       1,007 ns/iter (+/- 2)
test _b_0::bench_imperative_add_index                    ... bench:         919 ns/iter (+/- 1)
test _b_0::bench_imperative_add_itermut                  ... bench:         467 ns/iter (+/- 2)
test _b_0::bench_imperative_reallocating_add_index       ... bench:       1,528 ns/iter (+/- 4)
test _b_0::bench_imperative_reallocating_add_itermut     ... bench:         701 ns/iter (+/- 5)
test _b_0::linebreak_applicative_imperative_reallocating ... bench:           0 ns/iter (+/- 0)
test _b_1::bench_applicative_add_index                   ... bench:       1,991 ns/iter (+/- 14)
test _b_1::bench_applicative_add_itermut                 ... bench:       2,014 ns/iter (+/- 45)
test _b_1::bench_imperative_add_index                    ... bench:       1,864 ns/iter (+/- 12)
test _b_1::bench_imperative_add_itermut                  ... bench:         923 ns/iter (+/- 8)
test _b_1::bench_imperative_reallocating_add_index       ... bench:       3,096 ns/iter (+/- 13)
test _b_1::bench_imperative_reallocating_add_itermut     ... bench:       1,415 ns/iter (+/- 20)
test _b_1::linebreak_applicative_imperative_reallocating ... bench:           0 ns/iter (+/- 0)
test _b_2::bench_applicative_add_index                   ... bench:       3,959 ns/iter (+/- 21)
test _b_2::bench_applicative_add_itermut                 ... bench:       4,007 ns/iter (+/- 19)
test _b_2::bench_imperative_add_index                    ... bench:       3,654 ns/iter (+/- 8)
test _b_2::bench_imperative_add_itermut                  ... bench:       1,840 ns/iter (+/- 8)
test _b_2::bench_imperative_reallocating_add_index       ... bench:       6,107 ns/iter (+/- 22)
test _b_2::bench_imperative_reallocating_add_itermut     ... bench:       3,177 ns/iter (+/- 25)
test _b_2::linebreak_applicative_imperative_reallocating ... bench:           0 ns/iter (+/- 0)
test _b_3::bench_applicative_add_index                   ... bench:       7,888 ns/iter (+/- 31)
test _b_3::bench_applicative_add_itermut                 ... bench:       7,991 ns/iter (+/- 59)
test _b_3::bench_imperative_add_index                    ... bench:       7,298 ns/iter (+/- 17)
test _b_3::bench_imperative_add_itermut                  ... bench:       3,670 ns/iter (+/- 19)
test _b_3::bench_imperative_reallocating_add_index       ... bench:      12,167 ns/iter (+/- 35)
test _b_3::bench_imperative_reallocating_add_itermut     ... bench:       6,428 ns/iter (+/- 36)
test _b_3::linebreak_applicative_imperative_reallocating ... bench:           0 ns/iter (+/- 0)
test _b_4::bench_applicative_add_index                   ... bench:      16,148 ns/iter (+/- 126)
test _b_4::bench_applicative_add_itermut                 ... bench:      16,366 ns/iter (+/- 221)
test _b_4::bench_imperative_add_index                    ... bench:      14,645 ns/iter (+/- 148)
test _b_4::bench_imperative_add_itermut                  ... bench:       7,339 ns/iter (+/- 65)
test _b_4::bench_imperative_reallocating_add_index       ... bench:      43,689 ns/iter (+/- 309)
test _b_4::bench_imperative_reallocating_add_itermut     ... bench:      32,396 ns/iter (+/- 273)
test _b_4::linebreak_applicative_imperative_reallocating ... bench:           0 ns/iter (+/- 0)
test _b_5::bench_applicative_add_index                   ... bench:      51,163 ns/iter (+/- 170)
test _b_5::bench_applicative_add_itermut                 ... bench:      51,692 ns/iter (+/- 400)
test _b_5::bench_imperative_add_index                    ... bench:      29,256 ns/iter (+/- 186)
test _b_5::bench_imperative_add_itermut                  ... bench:      14,835 ns/iter (+/- 108)
test _b_5::bench_imperative_reallocating_add_index       ... bench:      87,279 ns/iter (+/- 380)
test _b_5::bench_imperative_reallocating_add_itermut     ... bench:      65,990 ns/iter (+/- 398)
test _b_5::linebreak_applicative_imperative_reallocating ... bench:           0 ns/iter (+/- 0)
test _b_6::bench_applicative_add_index                   ... bench:      97,496 ns/iter (+/- 258)
test _b_6::bench_applicative_add_itermut                 ... bench:      98,849 ns/iter (+/- 556)
test _b_6::bench_imperative_add_index                    ... bench:      58,496 ns/iter (+/- 298)
test _b_6::bench_imperative_add_itermut                  ... bench:      29,850 ns/iter (+/- 98)
test _b_6::bench_imperative_reallocating_add_index       ... bench:     165,795 ns/iter (+/- 864)
test _b_6::bench_imperative_reallocating_add_itermut     ... bench:     124,239 ns/iter (+/- 23,023)
test _b_6::linebreak_applicative_imperative_reallocating ... bench:           0 ns/iter (+/- 0)
test _b_7::bench_applicative_add_index                   ... bench:     193,556 ns/iter (+/- 1,081)
test _b_7::bench_applicative_add_itermut                 ... bench:     196,236 ns/iter (+/- 1,349)
test _b_7::bench_imperative_add_index                    ... bench:     117,039 ns/iter (+/- 334)
test _b_7::bench_imperative_add_itermut                  ... bench:      59,679 ns/iter (+/- 330)
test _b_7::bench_imperative_reallocating_add_index       ... bench:     330,134 ns/iter (+/- 4,477)
test _b_7::bench_imperative_reallocating_add_itermut     ... bench:     246,488 ns/iter (+/- 1,615)
test _b_7::linebreak_applicative_imperative_reallocating ... bench:           0 ns/iter (+/- 0)
test _b_8::bench_applicative_add_index                   ... bench:     390,786 ns/iter (+/- 5,833)
test _b_8::bench_applicative_add_itermut                 ... bench:     395,830 ns/iter (+/- 4,165)
test _b_8::bench_imperative_add_index                    ... bench:     236,522 ns/iter (+/- 1,089)
test _b_8::bench_imperative_add_itermut                  ... bench:     121,987 ns/iter (+/- 462)
test _b_8::bench_imperative_reallocating_add_index       ... bench:     666,722 ns/iter (+/- 10,072)
test _b_8::bench_imperative_reallocating_add_itermut     ... bench:     504,304 ns/iter (+/- 13,070)
test _b_8::linebreak_applicative_imperative_reallocating ... bench:           0 ns/iter (+/- 0)
test _b_9::bench_applicative_add_index                   ... bench:     822,417 ns/iter (+/- 11,205)
test _b_9::bench_applicative_add_itermut                 ... bench:     827,093 ns/iter (+/- 10,132)
test _b_9::bench_imperative_add_index                    ... bench:     478,568 ns/iter (+/- 3,008)
test _b_9::bench_imperative_add_itermut                  ... bench:     250,486 ns/iter (+/- 1,647)
test _b_9::bench_imperative_reallocating_add_index       ... bench:   1,386,815 ns/iter (+/- 30,359)
test _b_9::bench_imperative_reallocating_add_itermut     ... bench:   1,093,889 ns/iter (+/- 10,134)
test _b_9::linebreak_applicative_imperative_reallocating ... bench:           0 ns/iter (+/- 0)

@bluss
Copy link
Member

bluss commented Feb 25, 2016

Your code is not the same as in the advice it's a different case 😄. I spent some time looking into lock step iteration and not found anything better to recommend than the slice to equal length and iterate method. It requires a quite exact adherence to the template, and it will optimize very well. Forum post: https://users.rust-lang.org/t/how-to-zip-two-slices-efficiently/2048/14

While working on some demonstration code involving a matrix multiply, I discovered that the claim that Rust's iterator abstractions boil away to something competitive with hand-written assembly does not hold for a for-loop over slice.iter_mut.

We never claim this. Iterators are not a silver bullet and we should not pretend they are.

What we regularly celebrate is how well the slice iterator compiles. Because it does, in the very simplest cases, like exactly these:

/// A
for elt in slice.iter_mut() { *elt = value; }

/// B
let mut sum = 0; /* must be an integer, not float (side track) */
for elt in slice.iter() { sum += *elt; }

and so on. Lock step iteration (zip) is a completely different beast, and we know we are far from optimal there. See also this thread.

@bluss
Copy link
Member

bluss commented Feb 25, 2016

This is how to write it in the "how to zip slices" approach and it gives a 5-10x speedup over the best of the other cases. (Because it autovectorizes).

pub fn imperative_add_vec_vec_zipslices(a: &Vec<i32>, b: &Vec<i32>, c: &mut Vec<i32>) {
    let len = min(a.len(), b.len());
    let len = min(len, c.len());
    let a = &a[..len];
    let b = &b[..len];
    let c = &mut c[..len];
    for i in 0..len {
        c[i] = a[i] + b[i];
    }
}

@pnkfelix
Copy link
Member Author

@dotdash the machine is a Core i7-4980HQ @2.8Ghz.

I didn't think it was too heavily loaded when I did the runs, but then again I did have two Firefox instances running, so who knows... I'll try it again after shutting those down and rebooting, to see if the variance drops.

the results you showed in your run are very encouraging; if the applicative cases match this well for most target configurations, then we can probably just close this ticket. (But I want to at least try a run on my desktop machine to see if I can replicate your results there.)

@pnkfelix
Copy link
Member Author

@bluss hmm thanks for that post. I guess I won't be able to show off the "prettiest" code in this case and still claim that its the best you could hope to get.

@bluss
Copy link
Member

bluss commented Feb 25, 2016

I'm not sure what the result should be for any of the loops, because they don't ensure that all indices visited in the loop are in bounds before starting.

I believe that's a boundary which will be very hard to overcome. What should eventually work here is zip() or another way to ensure we don't hit bounds checking. Like the "how to zip a slice" approach.

@pnkfelix Covering this approach is going to be an important part of my ndarray talk in march. Ndarray put this know-how into use for efficient ndarray-and-ndarray operations. (No hand rolled matrix multiply though).

@bluss
Copy link
Member

bluss commented Feb 25, 2016

@pnkfelix Another downer from the harsh reality of rust + numerics

Consider the sum of a sequence: .fold(0, |a, b| a + b)

  • Integers: Autovectorizes, produces a big hunk of code, is fast. 😄
  • Floating point: No autovectorization because reordering float additions is not allowed by the default rules for float arithmetic. There's no "fast-math" support in rust. 😞

This is also related to matrix multiplication, so I thought I'd mention it..

@pnkfelix
Copy link
Member Author

@bluss It seems like you are focusing on the fact that all of my loops, even the ones that use iter_mut, have some indexing operations (namely for the source expressions) where I have not ensured apriori that the indices are in bounds. Does that sound correct? Are you thinking this could somehow be causing problems for the initialization of my destination vector via iter_mut?

Is it wrong of me to think that if iter_mut is slower than using indexing solely for the destination array, that represents overhead that should not be present, as long as I am using the same source expression for each element?

  • (I said "if" above because it could be that my benchmarking methodology was flawed, as noted by @dotdash)
  • Note: My true goal was not actually to produce the tightest possible code for this case. It was just to show a simple bit of code (think "first slide in a tutorial") that uses indexing for both the inputs and the output, and then show how to use iter_mut for initializing the output -- but to keep the number of deltas between the two examples small, I kept using indexing for the inputs. I don't know why I decided to note this; I guess its just to say "while it is good to know about this alternative forms, I'm still curious about the fundamental issue of using iter_mut to initialize a freshly allocated vector."

Update: fixed typo and reordered some words to try to make my text clearer.

@bluss
Copy link
Member

bluss commented Feb 25, 2016

@pnkfelix Yes. I just have a very square head; both of those loops have bounds check and branch to panic in the loop body, so I didn't think anything else would be important (and it basically isn't). A branch to panic inside a loop will disrupt many potential optimizations, even seemingly unrelated ones.

itermut is slightly better on an x86-64 Sandy Bridge laptop

test _b_1::bench_imperative_add_index                    ... bench:       2,321 ns/iter (+/- 52)
test _b_1::bench_imperative_add_itermut                  ... bench:       2,256 ns/iter (+/- 49)
test _b_1::bench_imperative_add_zipslices                ... bench:         295 ns/iter (+/- 8)

@bluss
Copy link
Member

bluss commented Feb 25, 2016

And yes, .zip() outperforms the indexed approaches, even if it's far from optimal.

pub fn imperative_add_vec_vec_zip(a: &Vec<i32>, b: &Vec<i32>, c: &mut Vec<i32>) {
    for ((a, b), c) in a.iter().zip(b).zip(c) {
        *c = *a + *b;
    }
}
test _b_1::bench_imperative_add_zip                      ... bench:       1,607 ns/iter (+/- 22)

@pnkfelix
Copy link
Member Author

@dotdash just an FYI: I redid the benchmarks on this Mac after logging in and not firing up any application but Terminal; the results were odd: the overall elapsed times themselves seem like they went down, at least for the small inputs (so there probably was interference from background tasks), but the variance if anything seems to have gone up. Ugh.

  • (I also tried rebooting into Single User Mode (SUM), since in the past I have found that can be a decent way to reduce variance when doing benchmarking on Mac OS X machines, but my Rust installation is pretty tightly connected to my user account, while SUM leaves you in the root account and re-logging in as myself is surprisingly difficult.)

The most important thing is that the results for the pairs of applicative cases still illustrate iter_mut to be significantly more expensive than indexing; I'll throw out the small inputs since the variance on them really is ridiculously large. Here's what's left, in terms of the comparison of applicative approaches.

test _b_7::bench_applicative_add_index                   ... bench:     175,232 ns/iter (+/- 54,178)
test _b_7::bench_applicative_add_itermut                 ... bench:     195,411 ns/iter (+/- 132,282)

test _b_8::bench_applicative_add_index                   ... bench:     321,362 ns/iter (+/- 130,399)
test _b_8::bench_applicative_add_itermut                 ... bench:     376,764 ns/iter (+/- 362,100)

test _b_9::bench_applicative_add_index                   ... bench:     669,536 ns/iter (+/- 37,451)
test _b_9::bench_applicative_add_itermut                 ... bench:     811,612 ns/iter (+/- 43,974)

(I'm going to try revising the benchmark, or at least add new variants, to use @bluss -suggested technique of first extracting slices of known size for the two inputs, just to try to making the source expression code as simple as possible, and see what that leads to.)

@pnkfelix
Copy link
Member Author

(I'm also going to try running the benchmark on a desktop linux machine I have handy...)

@pnkfelix
Copy link
Member Author

Okay, apparently @bluss 's suggestion does make a huge difference to how well LLVM is able to optimize writing to the iter_mut generated destination: I redid the tests (gist here) where the old ones are called "original", or "orig" for short, and the new ones are called "resliced" ("resl" for short)

test _b_0_orig::bench_applicative_add_index                   ... bench:         698 ns/iter (+/- 85)
test _b_0_orig::bench_applicative_add_itermut                 ... bench:         989 ns/iter (+/- 23)

test _b_0_resl::bench_applicative_add_index                   ... bench:         556 ns/iter (+/- 10)
test _b_0_resl::bench_applicative_add_itermut                 ... bench:         650 ns/iter (+/- 131)

test _b_1_orig::bench_applicative_add_index                   ... bench:       1,390 ns/iter (+/- 165)
test _b_1_orig::bench_applicative_add_itermut                 ... bench:       1,980 ns/iter (+/- 81)

test _b_1_resl::bench_applicative_add_index                   ... bench:       1,195 ns/iter (+/- 109)
test _b_1_resl::bench_applicative_add_itermut                 ... bench:       1,294 ns/iter (+/- 9)

test _b_2_orig::bench_applicative_add_index                   ... bench:       2,830 ns/iter (+/- 318)
test _b_2_orig::bench_applicative_add_itermut                 ... bench:       4,422 ns/iter (+/- 510)

test _b_2_resl::bench_applicative_add_index                   ... bench:       2,594 ns/iter (+/- 58)
test _b_2_resl::bench_applicative_add_itermut                 ... bench:       2,617 ns/iter (+/- 22)

test _b_3_orig::bench_applicative_add_index                   ... bench:       5,654 ns/iter (+/- 206)
test _b_3_orig::bench_applicative_add_itermut                 ... bench:       8,506 ns/iter (+/- 1,067)

test _b_3_resl::bench_applicative_add_index                   ... bench:       4,741 ns/iter (+/- 111)
test _b_3_resl::bench_applicative_add_itermut                 ... bench:       5,467 ns/iter (+/- 105)

test _b_4_orig::bench_applicative_add_index                   ... bench:      11,828 ns/iter (+/- 1,301)
test _b_4_orig::bench_applicative_add_itermut                 ... bench:      16,470 ns/iter (+/- 382)

test _b_4_resl::bench_applicative_add_index                   ... bench:       9,980 ns/iter (+/- 361)
test _b_4_resl::bench_applicative_add_itermut                 ... bench:      11,014 ns/iter (+/- 629)

test _b_5_orig::bench_applicative_add_index                   ... bench:      38,566 ns/iter (+/- 3,976)
test _b_5_orig::bench_applicative_add_itermut                 ... bench:      46,196 ns/iter (+/- 270)

test _b_5_resl::bench_applicative_add_index                   ... bench:      32,448 ns/iter (+/- 326)
test _b_5_resl::bench_applicative_add_itermut                 ... bench:      35,200 ns/iter (+/- 8,848)

test _b_6_orig::bench_applicative_add_index                   ... bench:      75,874 ns/iter (+/- 15,710)
test _b_6_orig::bench_applicative_add_itermut                 ... bench:      90,326 ns/iter (+/- 519)

test _b_6_resl::bench_applicative_add_index                   ... bench:      63,022 ns/iter (+/- 2,912)
test _b_6_resl::bench_applicative_add_itermut                 ... bench:      68,033 ns/iter (+/- 2,112)

test _b_7_orig::bench_applicative_add_index                   ... bench:     140,325 ns/iter (+/- 15,254)
test _b_7_orig::bench_applicative_add_itermut                 ... bench:     177,363 ns/iter (+/- 4,136)

test _b_7_resl::bench_applicative_add_index                   ... bench:     122,729 ns/iter (+/- 4,840)
test _b_7_resl::bench_applicative_add_itermut                 ... bench:     133,543 ns/iter (+/- 935)

test _b_8_orig::bench_applicative_add_index                   ... bench:     286,251 ns/iter (+/- 39,570)
test _b_8_orig::bench_applicative_add_itermut                 ... bench:     367,963 ns/iter (+/- 50,702)

test _b_8_resl::bench_applicative_add_index                   ... bench:     272,901 ns/iter (+/- 6,165)
test _b_8_resl::bench_applicative_add_itermut                 ... bench:     282,444 ns/iter (+/- 29,195)

test _b_9_orig::bench_applicative_add_index                   ... bench:     636,538 ns/iter (+/- 46,587)
test _b_9_orig::bench_applicative_add_itermut                 ... bench:     807,549 ns/iter (+/- 31,832)

test _b_9_resl::bench_applicative_add_index                   ... bench:     572,607 ns/iter (+/- 6,672)
test _b_9_resl::bench_applicative_add_itermut                 ... bench:     620,572 ns/iter (+/- 8,681)

The performance delta between index and itermut is much more like what I was expecting to see for the "resl" cases.

So now I'm not sure if there's really anything actionable here for now...

@bluss
Copy link
Member

bluss commented Oct 3, 2016

.zip() specialization has landed, so for the slice iterators this issue should be behind us. Yes, it can autovectoriz in both the binary and trinary zip case.

However: There is some hiccup in optimizing .zip() at times, which is issue #36920

@steveklabnik steveklabnik added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. and removed A-libs labels Mar 24, 2017
@Mark-Simulacrum
Copy link
Member

@pnkfelix I don't think I'm entirely following the discussion here, but can we close this? You mentioned in your last comment that there might not be anything actionable for now, but I don't know quite what we'd do here in the future either...

@Mark-Simulacrum Mark-Simulacrum added C-bug Category: This is a bug. C-enhancement Category: An issue proposing an enhancement or a PR with one. and removed C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Jul 24, 2017
@Mark-Simulacrum
Copy link
Member

I'm going to go ahead and close this since it looks like this isn't actionable currently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants