Skip to content

Vec::retain() is significantly slower than into_iter().filter().collect() #91497

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kageru opened this issue Dec 3, 2021 · 8 comments · Fixed by #91527
Closed

Vec::retain() is significantly slower than into_iter().filter().collect() #91497

kageru opened this issue Dec 3, 2021 · 8 comments · Fixed by #91527
Assignees
Labels
A-codegen Area: Code generation A-collections Area: `std::collections` A-iterators Area: Iterators C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@kageru
Copy link

kageru commented Dec 3, 2021

I noticed today that using Vec::retain is much slower than filtering and allocating a new Vector.
I realize the former is probably more memory efficient, but I still found it surprising that it would be that much slower (or really slower at all).

When testing with this code:

#![feature(test)]
extern crate test;

fn main() {
    let xs: Vec<i32> = (0..1000).collect();
    assert_eq!(even_with_retain(xs.clone()), even_with_filter(xs.clone()));
}

pub fn even_with_retain(mut xs: Vec<i32>) -> Vec<i32> {
    xs.retain(|x| x & 1 == 0);
    xs
}

pub fn even_with_filter(xs: Vec<i32>) -> Vec<i32> {
    xs.into_iter().filter(|x| x & 1 == 0).collect()
}

#[bench]
fn bench_retain(b: &mut test::Bencher) {
    let xs: Vec<i32> = (0..1000).collect();
    b.iter(|| assert_eq!(even_with_retain(test::black_box(xs.clone())).len(), 500));
}

#[bench]
fn bench_filter_collect(b: &mut test::Bencher) {
    let xs: Vec<i32> = (0..1000).collect();
    b.iter(|| assert_eq!(even_with_filter(test::black_box(xs.clone())).len(), 500));
}

on 1.59.0-nightly (48a5999 2021-12-01), I get these benchmark results:

test bench_filter_collect ... bench:         383 ns/iter (+/- 4)
test bench_retain         ... bench:       1,891 ns/iter (+/- 17)

on a Ryzen 5900X running Linux. Testing on a different machine (Xeon E3-1271 v3), I get similar numbers:

test bench_filter_collect ... bench:         498 ns/iter (+/- 29)
test bench_retain         ... bench:       1,800 ns/iter (+/- 44)

Vec::retain seemed like the obvious choice to me, so it being slower is either a bug or should be documented somewhere.

Godbolt

@BluBb-mADe
Copy link

BluBb-mADe commented Dec 3, 2021

i7 8700k windows 10
default configuration:

test bench_filter_collect ... bench:       1,038 ns/iter (+/- 48)
test bench_retain         ... bench:       1,316 ns/iter (+/- 19)

with lto = true

test bench_filter_collect ... bench:         360 ns/iter (+/- 14)
test bench_retain         ... bench:       1,335 ns/iter (+/- 46)

rustc 1.59.0-nightly (acbe444)

@kageru
Copy link
Author

kageru commented Dec 3, 2021

Setting lto = true makes the disparity even bigger for me:

test bench_filter_collect ... bench:         199 ns/iter (+/- 6)
test bench_retain         ... bench:       1,878 ns/iter (+/- 26)

@inquisitivecrystal inquisitivecrystal added A-codegen Area: Code generation A-collections Area: `std::collections` A-iterators Area: Iterators C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Dec 3, 2021
@the8472
Copy link
Member

the8472 commented Dec 3, 2021

It looks like retain is optimized (#81126) for small or large probabilities of keeping elements, i.e. long runs of elements that will be either kept or deleted. into_iter().filter().collect() on the other hand just reads elements one by one, filters them and writes out those that passed, i.e. it does not care about runs.

Since the test-case here drops every second element it's the worst possible case for retain, possibly even worse than random selection (since random clusters may have runs).

Edit: Hrm, the benchmarks look like they're optimizing for high/low retain probabilities; but the code doesn't, it still moves one element at a time when previous ones have been dropped.

@hkratz
Copy link
Contributor

hkratz commented Dec 3, 2021

There is something weird going, because if you put a copy of the current retain() implementation in the source file it is much faster:

test bench_filter_collect ... bench:         307 ns/iter (+/- 12)
test bench_my_retain      ... bench:         431 ns/iter (+/- 3)
test bench_retain         ... bench:       1,999 ns/iter (+/- 7)

Godbolt

Edit: Doh, but at least now we now that #88060 caused some major slowdown.

@the8472
Copy link
Member

the8472 commented Dec 3, 2021

You're not copying the most recent version of retain. process_one has gained a const bool parameter in #88060 and thus gets instantiated in two flavors.

#[inline(always)]
fn process_one<F, T, A: Allocator, const DELETED: bool>(
f: &mut F,
g: &mut BackshiftOnDrop<'_, T, A>,
) -> bool
where
F: FnMut(&mut T) -> bool,
{
// SAFETY: Unchecked element must be valid.
let cur = unsafe { &mut *g.v.as_mut_ptr().add(g.processed_len) };
if !f(cur) {
// Advance early to avoid double drop if `drop_in_place` panicked.
g.processed_len += 1;
g.deleted_cnt += 1;
// SAFETY: We never touch this element again after dropped.
unsafe { ptr::drop_in_place(cur) };
// We already advanced the counter.
return false;
}
if DELETED {
// SAFETY: `deleted_cnt` > 0, so the hole slot must not overlap with current element.
// We use copy for move, and never touch this element again.
unsafe {
let hole_slot = g.v.as_mut_ptr().add(g.processed_len - g.deleted_cnt);
ptr::copy_nonoverlapping(cur, hole_slot, 1);
}
}
g.processed_len += 1;
return true;
}
// Stage 1: Nothing was deleted.
while g.processed_len != original_len {
if !process_one::<F, T, A, false>(&mut f, &mut g) {
break;
}
}
// Stage 2: Some elements were deleted.
while g.processed_len != original_len {
process_one::<F, T, A, true>(&mut f, &mut g);
}

@BluBb-mADe
Copy link

BluBb-mADe commented Dec 3, 2021

I tried a few scenarios that should heavily favor retain but the effect remains.
|x| *x == 1

test bench_filter_collect ... bench:         284 ns/iter (+/- 18)
test bench_retain         ... bench:       1,338 ns/iter (+/- 55)

|x| *x != 1

test bench_filter_collect ... bench:         413 ns/iter (+/- 130)
test bench_retain         ... bench:       1,369 ns/iter (+/- 28)

|x| *x == 500

test bench_filter_collect ... bench:         289 ns/iter (+/- 15)
test bench_retain         ... bench:       1,332 ns/iter (+/- 54)

|x| *x >= 500

test bench_filter_collect ... bench:         439 ns/iter (+/- 80)
test bench_retain         ... bench:       1,379 ns/iter (+/- 185)

i7 8700k w10 rustc 1.59.0-nightly (acbe444)

@the8472
Copy link
Member

the8472 commented Dec 4, 2021

I'll take a stab at this. @rustbot claim

@the8472
Copy link
Member

the8472 commented Dec 4, 2021

Submitted #91527, if anyone wants to confirm my benchmark results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-collections Area: `std::collections` A-iterators Area: Iterators C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants