-
Notifications
You must be signed in to change notification settings - Fork 984
Perf: improve sort via partition_validity
to use fast path for bit map scan (up to 30% faster)
#7962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Performance result: critcmp --filter "nulls to indices" fast_path_for_bit_map_scan main
group fast_path_for_bit_map_scan main
----- -------------------------- ----
sort f32 nulls to indices 2^12 1.00 17.3±0.12µs ? ?/sec 1.21 20.8±0.17µs ? ?/sec
sort i32 nulls to indices 2^10 1.00 3.6±0.05µs ? ?/sec 1.28 4.6±0.03µs ? ?/sec
sort i32 nulls to indices 2^12 1.00 14.9±0.10µs ? ?/sec 1.24 18.5±0.14µs ? ?/sec
sort string[0-100] nulls to indices 2^12 1.00 54.9±0.42µs ? ?/sec 1.08 59.4±1.11µs ? ?/sec
sort string[0-10] nulls to indices 2^12 1.00 71.8±4.21µs ? ?/sec 1.03 74.0±2.10µs ? ?/sec
sort string[0-400] nulls to indices 2^12 1.00 54.9±0.54µs ? ?/sec 1.06 58.4±0.50µs ? ?/sec
sort string[1000] nulls to indices 2^12 1.00 53.8±0.42µs ? ?/sec 1.07 57.7±0.43µs ? ?/sec
sort string[100] nulls to indices 2^12 1.00 53.1±0.55µs ? ?/sec 1.07 56.7±0.56µs ? ?/sec
sort string[10] dict nulls to indices 2^12 1.00 66.9±0.50µs ? ?/sec 1.06 70.7±0.62µs ? ?/sec
sort string[10] nulls to indices 2^12 1.00 50.7±0.46µs ? ?/sec 1.07 54.3±0.48µs ? ?/sec
sort string_view[0-400] nulls to indices 2^12 1.00 27.1±0.23µs ? ?/sec 1.14 30.9±0.30µs ? ?/sec
sort string_view[10] nulls to indices 2^12 1.00 24.0±0.22µs ? ?/sec 1.16 27.8±0.23µs ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12 1.00 23.4±0.18µs ? ?/sec 1.15 27.0±0.20µs ? ?/sec |
dbf747c
to
c7a0ae9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @zhuqi-lucas -- that is very impressive benchmark results
I think for code like this we should add some more testing to cover all the corner cases -- I left some suggestions. Let me know what you think
/// Scans the null bitmap and partitions valid/null indices efficiently. | ||
/// Uses bit-level operations to extract bit positions. | ||
/// This function is only called when nulls exist. | ||
#[inline(always)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does look quite cool @zhuqi-lucas - if we want to go with this approach, I think we should add some more testing for this -- perhaps via a fuzz test that:
- Calls partition_validitiy_scan with random Boolean arrays
- Computed the expected results using a different algoithm (perhaps the original partition)
- Compares the results
We should also make sure that we test that this works when slicing the arrays (array.slice(3, array.len() - 3)
for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb for review , i added testing in latest PR, and i also using our existed logic to do this. So it will be more safe.
arrow-ord/src/sort.rs
Outdated
// Convert 8 bytes into a u64 word in little-endian order | ||
let w = u64::from_le_bytes(buffer[start..start + 8].try_into().unwrap()); | ||
|
||
// Iterate over each set bit in `z` (null mask) using a bit-parallel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isnt this the same as set_indices
? Can we reuse it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Dandandan for review and good suggestion, i am using set_indices_32 in latest PR because we need Vec for u32 here, and it's similar to set_indices, and i found to use set_indices_32 has 7% peformance improvement comparing to use set_indices then to change to u32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
arrow-ord/src/sort.rs
Outdated
let n_words = buffer.len() / 8; // number of 64-bit chunks in bitmap | ||
|
||
// Process the bitmap in 64-bit (8 byte) chunks for efficiency | ||
for word_idx in 0..n_words { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to use UnalignedBitChunk
to deal with arbitrarily sliced arrays.
Fascinating results! I did not expect this to be faster since we still have to visit each bit once. Must be related to branch prediction, or just writing to one of the slices at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jhorstmann for review, i try to use set_indices in latest PR. It should be more safe and clear.
Thank you @alamb @Dandandan @jhorstmann for review. I addressed comments in latest PR, and also added rich tests. Thanks! |
Latest result for new implement, it has a little regression, but still promising result and the code is more readable and clear: critcmp --filter "nulls to indices" fast_path_for_bit_map_scan main
group fast_path_for_bit_map_scan main
----- -------------------------- ----
sort f32 nulls to indices 2^12 1.00 17.7±0.16µs ? ?/sec 1.18 20.9±0.18µs ? ?/sec
sort i32 nulls to indices 2^10 1.00 3.8±0.03µs ? ?/sec 1.20 4.5±0.04µs ? ?/sec
sort i32 nulls to indices 2^12 1.00 15.3±0.18µs ? ?/sec 1.20 18.4±0.14µs ? ?/sec
sort string[0-100] nulls to indices 2^12 1.00 55.7±0.85µs ? ?/sec 1.06 59.0±0.77µs ? ?/sec
sort string[0-10] nulls to indices 2^12 1.00 70.3±2.88µs ? ?/sec 1.04 73.2±1.74µs ? ?/sec
sort string[0-400] nulls to indices 2^12 1.00 55.5±0.48µs ? ?/sec 1.05 58.1±0.68µs ? ?/sec
sort string[1000] nulls to indices 2^12 1.00 54.7±0.59µs ? ?/sec 1.05 57.5±0.62µs ? ?/sec
sort string[100] nulls to indices 2^12 1.00 54.0±0.64µs ? ?/sec 1.05 56.6±0.49µs ? ?/sec
sort string[10] dict nulls to indices 2^12 1.00 68.2±0.69µs ? ?/sec 1.05 71.5±0.76µs ? ?/sec
sort string[10] nulls to indices 2^12 1.00 51.5±0.43µs ? ?/sec 1.05 54.4±0.59µs ? ?/sec
sort string_view[0-400] nulls to indices 2^12 1.00 27.9±0.27µs ? ?/sec 1.11 30.9±0.21µs ? ?/sec
sort string_view[10] nulls to indices 2^12 1.00 24.8±0.25µs ? ?/sec 1.11 27.6±0.26µs ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12 1.00 24.2±0.22µs ? ?/sec 1.12 27.1±0.20µs ? ?/sec |
@@ -323,4 +380,110 @@ mod tests { | |||
let mask = &[223, 23]; | |||
BitIterator::new(mask, 17, 0); | |||
} | |||
|
|||
#[test] | |||
fn test_bit_index_u32_iterator_basic() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests make sure index_u32 is same with original usize result.
} | ||
|
||
#[test] | ||
fn fuzz_partition_validity() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests compare to make sure our partition null logic is same as before.
🤖 |
🤖: Benchmark completed Details
|
Looks like some pretty good improvement to me. Thank you @zhuqi-lucas I need to spend some time studying this code in detail for correctness and I will hope to do so tomorrow |
Thank you @alamb ! |
} | ||
} | ||
|
||
impl<'a> Iterator for BitIndexU32Iterator<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this implementation somehow more performant than using the existing BitIndexIterator
and casting its items to u32
? The only difference I see is in the masking of the lowest bit, ^= 1 << bit_pos
vs &= self.curr - 1
, but I think llvm would know that those are equivalent. If it makes a difference, then we should adjust BitIndexIterator
the same way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jhorstmann for good question, actually it is more performant than using the existing BitIndexIterator because we cast directly to u32. But the BitIndexIterator will cast it to usize, so when we use BitIndexIterator, we need to cast from usize to u32, when i was testing, it caused the slowness.
The ^= 1 << bit_pos vs &= self.curr - 1, the performance almost same, it will not show difference, so i can use any of them.
I think i may change to a macro, so it will look more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will do a test to compare the performance too
Update: made #7979 and I queued up benchmark runs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conclusion from #7979 is that the u32 specific iterator is worth a 3-5% improvement: #7979 (comment)
Given that I think this PR makes sense to me
arrow-ord/src/sort.rs
Outdated
|
||
// also test a sliced view | ||
if len >= 4 { | ||
let slice = array.slice(2, len - 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend picking a random slice not just always by 2
Specifically I think it is important to ensure we slice some arrays by more than 64 to ensure they have to skip an entire 64 bit word in addition to having an offset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb for good suggestion! I will address it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed it in latest PR.
🤖 |
🤖: Benchmark completed Details
|
🤖 |
🤖: Benchmark completed Details
|
It looks like one regression from benchmark: sort i32 to indices 2^10 1.13 12.9±0.02µs ? ?/sec 1.00 11.4±0.02µs ? ?/sec
sort i32 to indices 2^12 1.15 63.3±0.15µs ? ?/sec 1.00 55.0±0.15µs ? ?/sec But i can't reproduce from my local. And the code is not changing anything for the not null cases. |
I'll rerun -- it can potentially be something related to the test machine |
🤖 |
🤖: Benchmark completed Details
|
Weird -- we get the same results. Maybe there is more code or something now so cache effects come into play on the test machine 🤔 I am inclined to go with this PR to get the improvements in general |
Thank you @alamb , i changed to use a linux now to reproduce also, but it can't reproduce also both mac and linux: sort i32 to indices 2^10
time: [12.857 µs 12.871 µs 12.886 µs]
change: [−0.8566% −0.6047% −0.3347%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe
sort i32 to indices 2^12
time: [57.662 µs 57.817 µs 58.056 µs]
change: [−3.9800% −3.5589% −3.1125%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) high mild
4 (4.00%) high severe |
Maybe we need to crank up the test somehow -- trying to measure changes in |
Good point @alamb , i try to increase the length of i32, it still no regression for this PR: sort i32 to indices 2^16
time: [565.57 µs 566.31 µs 567.12 µs]
change: [−0.0297% +0.2001% +0.4033%] (p = 0.07 > 0.05)
No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
1 (1.00%) low mild
4 (4.00%) high mild
4 (4.00%) high severe
sort i32 to indices 2^18
time: [2.7443 ms 2.7497 ms 2.7554 ms]
change: [+0.1844% +0.4567% +0.7176%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild |
partition_validity
to use fast path for bit map scan (up to 30% faster)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @zhuqi-lucas
@jhorstmann / @Dandandan I wonder what your opinions on merging this PR is
} | ||
} | ||
|
||
impl<'a> Iterator for BitIndexU32Iterator<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conclusion from #7979 is that the u32 specific iterator is worth a 3-5% improvement: #7979 (comment)
Given that I think this PR makes sense to me
I plan to merge this PR tomorrow unless anyone else would like additional time to review |
Thanks again @zhuqi-lucas ! |
Which issue does this PR close?
This PR is follow-up for:
#7937
I want to experiment the performance for Using word-level (u64) bit scanning:
Details:
#7937 (review)
Rationale for this change
Using word-level (u64) bit scanning
Use set_indices to implement this, but we need u32 index , so i also add set_indices_u32, the performance shows %7 improvement comparing to set_indices then to case to u32.
What changes are included in this PR?
Using word-level (u64) bit scanning
Use set_indices to implement this, but we need u32 index , so i also add set_indices_u32, the performance shows %7 improvement comparing to set_indices then to case to u32.
Are these changes tested?
Yes, add unit test also fuzz testing, also existed testing coverage sort fuzz.
Are there any user-facing changes?
No