Skip to content

Perf: improve sort via partition_validity to use fast path for bit map scan (up to 30% faster) #7962

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 29, 2025

Conversation

zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Jul 18, 2025

Which issue does this PR close?

This PR is follow-up for:

#7937

I want to experiment the performance for Using word-level (u64) bit scanning:

Details:

#7937 (review)

Rationale for this change

Using word-level (u64) bit scanning

Use set_indices to implement this, but we need u32 index , so i also add set_indices_u32, the performance shows %7 improvement comparing to set_indices then to case to u32.

What changes are included in this PR?

Using word-level (u64) bit scanning

Use set_indices to implement this, but we need u32 index , so i also add set_indices_u32, the performance shows %7 improvement comparing to set_indices then to case to u32.

Are these changes tested?

Yes, add unit test also fuzz testing, also existed testing coverage sort fuzz.

Are there any user-facing changes?

No

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 18, 2025
@zhuqi-lucas
Copy link
Contributor Author

Performance result:

critcmp  --filter "nulls to indices" fast_path_for_bit_map_scan  main
group                                                   fast_path_for_bit_map_scan             main
-----                                                   --------------------------             ----
sort f32 nulls to indices 2^12                          1.00     17.3±0.12µs        ? ?/sec    1.21     20.8±0.17µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.00      3.6±0.05µs        ? ?/sec    1.28      4.6±0.03µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.00     14.9±0.10µs        ? ?/sec    1.24     18.5±0.14µs        ? ?/sec
sort string[0-100] nulls to indices 2^12                1.00     54.9±0.42µs        ? ?/sec    1.08     59.4±1.11µs        ? ?/sec
sort string[0-10] nulls to indices 2^12                 1.00     71.8±4.21µs        ? ?/sec    1.03     74.0±2.10µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.00     54.9±0.54µs        ? ?/sec    1.06     58.4±0.50µs        ? ?/sec
sort string[1000] nulls to indices 2^12                 1.00     53.8±0.42µs        ? ?/sec    1.07     57.7±0.43µs        ? ?/sec
sort string[100] nulls to indices 2^12                  1.00     53.1±0.55µs        ? ?/sec    1.07     56.7±0.56µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.00     66.9±0.50µs        ? ?/sec    1.06     70.7±0.62µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.00     50.7±0.46µs        ? ?/sec    1.07     54.3±0.48µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.00     27.1±0.23µs        ? ?/sec    1.14     30.9±0.30µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.00     24.0±0.22µs        ? ?/sec    1.16     27.8±0.23µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.00     23.4±0.18µs        ? ?/sec    1.15     27.0±0.20µs        ? ?/sec

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zhuqi-lucas -- that is very impressive benchmark results

I think for code like this we should add some more testing to cover all the corner cases -- I left some suggestions. Let me know what you think

/// Scans the null bitmap and partitions valid/null indices efficiently.
/// Uses bit-level operations to extract bit positions.
/// This function is only called when nulls exist.
#[inline(always)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does look quite cool @zhuqi-lucas - if we want to go with this approach, I think we should add some more testing for this -- perhaps via a fuzz test that:

  1. Calls partition_validitiy_scan with random Boolean arrays
  2. Computed the expected results using a different algoithm (perhaps the original partition)
  3. Compares the results

We should also make sure that we test that this works when slicing the arrays (array.slice(3, array.len() - 3) for example

Copy link
Contributor Author

@zhuqi-lucas zhuqi-lucas Jul 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb for review , i added testing in latest PR, and i also using our existed logic to do this. So it will be more safe.

// Convert 8 bytes into a u64 word in little-endian order
let w = u64::from_le_bytes(buffer[start..start + 8].try_into().unwrap());

// Iterate over each set bit in `z` (null mask) using a bit-parallel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isnt this the same as set_indices? Can we reuse it?

Copy link
Contributor Author

@zhuqi-lucas zhuqi-lucas Jul 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Dandandan for review and good suggestion, i am using set_indices_32 in latest PR because we need Vec for u32 here, and it's similar to set_indices, and i found to use set_indices_32 has 7% peformance improvement comparing to use set_indices then to change to u32.

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

let n_words = buffer.len() / 8; // number of 64-bit chunks in bitmap

// Process the bitmap in 64-bit (8 byte) chunks for efficiency
for word_idx in 0..n_words {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to use UnalignedBitChunk to deal with arbitrarily sliced arrays.

Fascinating results! I did not expect this to be faster since we still have to visit each bit once. Must be related to branch prediction, or just writing to one of the slices at a time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jhorstmann for review, i try to use set_indices in latest PR. It should be more safe and clear.

@zhuqi-lucas
Copy link
Contributor Author

Thank you @alamb @Dandandan @jhorstmann for review.

I addressed comments in latest PR, and also added rich tests. Thanks!

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Jul 19, 2025

Latest result for new implement, it has a little regression, but still promising result and the code is more readable and clear:

critcmp  --filter "nulls to indices" fast_path_for_bit_map_scan  main
group                                                   fast_path_for_bit_map_scan             main
-----                                                   --------------------------             ----
sort f32 nulls to indices 2^12                          1.00     17.7±0.16µs        ? ?/sec    1.18     20.9±0.18µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.00      3.8±0.03µs        ? ?/sec    1.20      4.5±0.04µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.00     15.3±0.18µs        ? ?/sec    1.20     18.4±0.14µs        ? ?/sec
sort string[0-100] nulls to indices 2^12                1.00     55.7±0.85µs        ? ?/sec    1.06     59.0±0.77µs        ? ?/sec
sort string[0-10] nulls to indices 2^12                 1.00     70.3±2.88µs        ? ?/sec    1.04     73.2±1.74µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.00     55.5±0.48µs        ? ?/sec    1.05     58.1±0.68µs        ? ?/sec
sort string[1000] nulls to indices 2^12                 1.00     54.7±0.59µs        ? ?/sec    1.05     57.5±0.62µs        ? ?/sec
sort string[100] nulls to indices 2^12                  1.00     54.0±0.64µs        ? ?/sec    1.05     56.6±0.49µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.00     68.2±0.69µs        ? ?/sec    1.05     71.5±0.76µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.00     51.5±0.43µs        ? ?/sec    1.05     54.4±0.59µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.00     27.9±0.27µs        ? ?/sec    1.11     30.9±0.21µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.00     24.8±0.25µs        ? ?/sec    1.11     27.6±0.26µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.00     24.2±0.22µs        ? ?/sec    1.12     27.1±0.20µs        ? ?/sec

@@ -323,4 +380,110 @@ mod tests {
let mask = &[223, 23];
BitIterator::new(mask, 17, 0);
}

#[test]
fn test_bit_index_u32_iterator_basic() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests make sure index_u32 is same with original usize result.

}

#[test]
fn fuzz_partition_validity() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests compare to make sure our partition null logic is same as before.

@alamb
Copy link
Contributor

alamb commented Jul 22, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing fast_path_for_bit_map_scan (556e673) to 82821e5 diff
BENCH_NAME=sort_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench sort_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=fast_path_for_bit_map_scan
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jul 22, 2025

🤖: Benchmark completed

Details

group                                                   fast_path_for_bit_map_scan             main
-----                                                   --------------------------             ----
lexsort (bool, bool) 2^12                               1.00    116.3±0.41µs        ? ?/sec    1.00    116.5±0.37µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.00    156.4±0.30µs        ? ?/sec    1.01    157.7±0.28µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.01     45.0±0.07µs        ? ?/sec    1.00     44.6±0.11µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.01    211.6±0.38µs        ? ?/sec    1.00    209.4±0.49µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     39.5±0.12µs        ? ?/sec    1.01     39.8±0.09µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.01     41.6±0.13µs        ? ?/sec    1.00     41.2±0.14µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.00     78.8±0.13µs        ? ?/sec    1.00     78.6±0.29µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.01    212.3±0.37µs        ? ?/sec    1.00    209.7±0.50µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.01     52.8±0.18µs        ? ?/sec    1.00     52.2±0.28µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.01    247.2±0.69µs        ? ?/sec    1.00    245.3±0.48µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.01     85.1±0.21µs        ? ?/sec    1.00     84.4±0.24µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.01     86.1±0.14µs        ? ?/sec    1.00     85.3±0.24µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.01     95.6±0.20µs        ? ?/sec    1.00     94.7±0.17µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.01    247.0±0.38µs        ? ?/sec    1.00    245.5±0.59µs        ? ?/sec
rank f32 2^12                                           1.06     72.7±0.46µs        ? ?/sec    1.00     68.5±1.30µs        ? ?/sec
rank f32 nulls 2^12                                     1.06     37.5±0.07µs        ? ?/sec    1.00     35.2±0.07µs        ? ?/sec
rank string[10] 2^12                                    1.03    263.4±0.88µs        ? ?/sec    1.00    255.7±0.85µs        ? ?/sec
rank string[10] nulls 2^12                              1.00    122.2±0.67µs        ? ?/sec    1.01    123.6±0.38µs        ? ?/sec
sort f32 2^12                                           1.00     60.3±0.36µs        ? ?/sec    1.00     60.2±0.45µs        ? ?/sec
sort f32 nulls 2^12                                     1.01     29.2±0.21µs        ? ?/sec    1.00     28.9±0.09µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.00     39.7±0.11µs        ? ?/sec    1.38     54.7±0.20µs        ? ?/sec
sort f32 to indices 2^12                                1.00     72.3±0.27µs        ? ?/sec    1.05     76.1±0.22µs        ? ?/sec
sort i32 2^10                                           1.00      7.3±0.01µs        ? ?/sec    1.00      7.3±0.01µs        ? ?/sec
sort i32 2^12                                           1.00     35.8±0.10µs        ? ?/sec    1.00     35.9±0.26µs        ? ?/sec
sort i32 nulls 2^10                                     1.00      4.8±0.01µs        ? ?/sec    1.00      4.8±0.02µs        ? ?/sec
sort i32 nulls 2^12                                     1.00     20.2±0.05µs        ? ?/sec    1.00     20.3±0.06µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.04      8.1±0.01µs        ? ?/sec    1.00      7.8±0.02µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.00     34.8±0.10µs        ? ?/sec    1.26     43.8±0.11µs        ? ?/sec
sort i32 to indices 2^10                                1.13     13.0±0.02µs        ? ?/sec    1.00     11.4±0.02µs        ? ?/sec
sort i32 to indices 2^12                                1.15     63.4±0.22µs        ? ?/sec    1.00     55.0±0.18µs        ? ?/sec
sort primitive run 2^12                                 1.00      6.3±0.01µs        ? ?/sec    1.11      7.0±0.01µs        ? ?/sec
sort primitive run to indices 2^12                      1.00      8.4±0.02µs        ? ?/sec    1.06      8.9±0.04µs        ? ?/sec
sort string[0-100] nulls to indices 2^12                1.00    151.8±0.40µs        ? ?/sec    1.11    168.1±0.46µs        ? ?/sec
sort string[0-100] to indices 2^12                      1.00    332.4±0.59µs        ? ?/sec    1.00    331.4±0.62µs        ? ?/sec
sort string[0-10] nulls to indices 2^12                 1.00    119.2±0.45µs        ? ?/sec    1.18    140.9±0.45µs        ? ?/sec
sort string[0-10] to indices 2^12                       1.00    262.8±0.64µs        ? ?/sec    1.00    263.1±0.68µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.00    131.7±0.40µs        ? ?/sec    1.13    148.8±0.37µs        ? ?/sec
sort string[0-400] to indices 2^12                      1.00    283.2±1.65µs        ? ?/sec    1.01    284.6±1.12µs        ? ?/sec
sort string[1000] nulls to indices 2^12                 1.00    124.5±0.40µs        ? ?/sec    1.11    138.7±0.59µs        ? ?/sec
sort string[1000] to indices 2^12                       1.00    252.6±1.48µs        ? ?/sec    1.00    251.9±1.93µs        ? ?/sec
sort string[100] nulls to indices 2^12                  1.00    118.8±0.55µs        ? ?/sec    1.13    134.0±0.32µs        ? ?/sec
sort string[100] to indices 2^12                        1.00    248.1±1.20µs        ? ?/sec    1.01    250.2±1.82µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.00    153.9±0.64µs        ? ?/sec    1.13    173.9±0.39µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.00    318.3±1.31µs        ? ?/sec    1.01    322.0±0.73µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.00    120.5±1.15µs        ? ?/sec    1.15    138.2±0.28µs        ? ?/sec
sort string[10] to indices 2^12                         1.00    243.5±0.48µs        ? ?/sec    1.01    245.7±0.59µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.00     64.1±0.29µs        ? ?/sec    1.25     80.3±0.32µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.00    133.6±0.53µs        ? ?/sec    1.01    134.7±0.41µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.00     46.9±0.40µs        ? ?/sec    1.32     61.8±0.43µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00    104.3±0.39µs        ? ?/sec    1.00    104.0±0.29µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.00     44.6±0.31µs        ? ?/sec    1.31     58.3±0.29µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.00     95.1±0.58µs        ? ?/sec    1.00     95.4±0.58µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Jul 22, 2025

Looks like some pretty good improvement to me. Thank you @zhuqi-lucas

I need to spend some time studying this code in detail for correctness and I will hope to do so tomorrow

@zhuqi-lucas
Copy link
Contributor Author

Looks like some pretty good improvement to me. Thank you @zhuqi-lucas

I need to spend some time studying this code in detail for correctness and I will hope to do so tomorrow

Thank you @alamb !

}
}

impl<'a> Iterator for BitIndexU32Iterator<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this implementation somehow more performant than using the existing BitIndexIterator and casting its items to u32? The only difference I see is in the masking of the lowest bit, ^= 1 << bit_pos vs &= self.curr - 1, but I think llvm would know that those are equivalent. If it makes a difference, then we should adjust BitIndexIterator the same way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jhorstmann for good question, actually it is more performant than using the existing BitIndexIterator because we cast directly to u32. But the BitIndexIterator will cast it to usize, so when we use BitIndexIterator, we need to cast from usize to u32, when i was testing, it caused the slowness.

The ^= 1 << bit_pos vs &= self.curr - 1, the performance almost same, it will not show difference, so i can use any of them.

I think i may change to a macro, so it will look more clear.

Copy link
Contributor

@alamb alamb Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do a test to compare the performance too

Update: made #7979 and I queued up benchmark runs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb !

I will do a test to compare the performance too

Update: made #7979 and I queued up benchmark runs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conclusion from #7979 is that the u32 specific iterator is worth a 3-5% improvement: #7979 (comment)

Given that I think this PR makes sense to me


// also test a sliced view
if len >= 4 {
let slice = array.slice(2, len - 4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend picking a random slice not just always by 2

Specifically I think it is important to ensure we slice some arrays by more than 64 to ensure they have to skip an entire 64 bit word in addition to having an offset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb for good suggestion! I will address it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed it in latest PR.

@alamb
Copy link
Contributor

alamb commented Jul 23, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing fast_path_for_bit_map_scan (3d6bcea) to 82821e5 diff
BENCH_NAME=sort_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench sort_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=fast_path_for_bit_map_scan
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jul 23, 2025

🤖: Benchmark completed

Details

group                                                   fast_path_for_bit_map_scan             main
-----                                                   --------------------------             ----
lexsort (bool, bool) 2^12                               1.00    116.0±0.54µs        ? ?/sec    1.00    116.2±0.41µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.00    156.4±0.25µs        ? ?/sec    1.01    157.8±0.38µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.00     45.1±0.09µs        ? ?/sec    1.01     45.4±0.09µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.01    211.6±0.55µs        ? ?/sec    1.00    209.7±0.37µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     38.8±0.06µs        ? ?/sec    1.00     38.9±0.09µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.00     41.3±0.08µs        ? ?/sec    1.01     41.5±0.13µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.00     78.6±0.12µs        ? ?/sec    1.00     78.5±0.18µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.01    214.1±0.29µs        ? ?/sec    1.00    211.2±0.33µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.01     52.6±0.55µs        ? ?/sec    1.00     52.3±0.10µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.01    247.0±0.40µs        ? ?/sec    1.00    245.5±0.51µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.00     84.7±0.16µs        ? ?/sec    1.00     84.6±0.32µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.00     85.6±0.19µs        ? ?/sec    1.00     85.6±0.35µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.00     95.0±0.23µs        ? ?/sec    1.00     95.1±0.23µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.01    247.9±4.36µs        ? ?/sec    1.00    244.9±0.43µs        ? ?/sec
rank f32 2^12                                           1.07     72.8±0.29µs        ? ?/sec    1.00     67.9±0.27µs        ? ?/sec
rank f32 nulls 2^12                                     1.06     37.4±0.05µs        ? ?/sec    1.00     35.1±0.06µs        ? ?/sec
rank string[10] 2^12                                    1.00    251.7±0.40µs        ? ?/sec    1.01    255.1±0.47µs        ? ?/sec
rank string[10] nulls 2^12                              1.00    121.5±0.23µs        ? ?/sec    1.01    123.0±0.47µs        ? ?/sec
sort f32 2^12                                           1.01     60.5±0.61µs        ? ?/sec    1.00     59.9±0.43µs        ? ?/sec
sort f32 nulls 2^12                                     1.01     29.1±0.11µs        ? ?/sec    1.00     28.9±0.11µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.00     39.8±0.16µs        ? ?/sec    1.36     54.3±0.11µs        ? ?/sec
sort f32 to indices 2^12                                1.00     72.2±0.27µs        ? ?/sec    1.05     76.0±0.33µs        ? ?/sec
sort i32 2^10                                           1.00      7.3±0.01µs        ? ?/sec    1.00      7.3±0.01µs        ? ?/sec
sort i32 2^12                                           1.00     35.8±0.13µs        ? ?/sec    1.00     35.9±0.11µs        ? ?/sec
sort i32 nulls 2^10                                     1.01      4.8±0.01µs        ? ?/sec    1.00      4.8±0.01µs        ? ?/sec
sort i32 nulls 2^12                                     1.00     20.3±0.05µs        ? ?/sec    1.00     20.3±0.05µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.04      8.1±0.01µs        ? ?/sec    1.00      7.8±0.01µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.00     34.8±0.09µs        ? ?/sec    1.26     43.9±0.18µs        ? ?/sec
sort i32 to indices 2^10                                1.13     12.9±0.02µs        ? ?/sec    1.00     11.4±0.02µs        ? ?/sec
sort i32 to indices 2^12                                1.15     63.3±0.15µs        ? ?/sec    1.00     55.0±0.15µs        ? ?/sec
sort primitive run 2^12                                 1.00      6.5±0.01µs        ? ?/sec    1.08      7.0±0.01µs        ? ?/sec
sort primitive run to indices 2^12                      1.00      8.4±0.01µs        ? ?/sec    1.06      8.9±0.02µs        ? ?/sec
sort string[0-100] nulls to indices 2^12                1.00    153.1±1.38µs        ? ?/sec    1.09    167.3±0.33µs        ? ?/sec
sort string[0-100] to indices 2^12                      1.00    333.2±3.43µs        ? ?/sec    1.00    331.6±0.99µs        ? ?/sec
sort string[0-10] nulls to indices 2^12                 1.00    118.5±0.22µs        ? ?/sec    1.19    140.6±0.39µs        ? ?/sec
sort string[0-10] to indices 2^12                       1.01    262.6±0.63µs        ? ?/sec    1.00    261.0±0.36µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.00    132.0±1.89µs        ? ?/sec    1.12    148.3±0.32µs        ? ?/sec
sort string[0-400] to indices 2^12                      1.00    283.4±3.33µs        ? ?/sec    1.01    285.8±0.75µs        ? ?/sec
sort string[1000] nulls to indices 2^12                 1.00    124.8±1.85µs        ? ?/sec    1.12    139.1±0.40µs        ? ?/sec
sort string[1000] to indices 2^12                       1.01    254.8±5.59µs        ? ?/sec    1.00    252.2±1.27µs        ? ?/sec
sort string[100] nulls to indices 2^12                  1.00    118.7±2.56µs        ? ?/sec    1.13    133.9±1.90µs        ? ?/sec
sort string[100] to indices 2^12                        1.00    247.5±3.20µs        ? ?/sec    1.00    248.3±1.20µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.00    154.3±0.40µs        ? ?/sec    1.13    173.6±0.30µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.00    316.7±1.04µs        ? ?/sec    1.01    321.1±0.50µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.00    120.0±1.33µs        ? ?/sec    1.14    136.6±0.42µs        ? ?/sec
sort string[10] to indices 2^12                         1.00    245.0±2.18µs        ? ?/sec    1.05    256.2±0.69µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.00     63.8±0.15µs        ? ?/sec    1.26     80.1±0.55µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.00    133.6±0.32µs        ? ?/sec    1.01    134.4±0.21µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.00     46.9±0.34µs        ? ?/sec    1.32     61.9±0.59µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00    104.2±0.32µs        ? ?/sec    1.00    104.1±0.36µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.00     44.6±0.28µs        ? ?/sec    1.31     58.5±0.31µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.00     95.2±0.34µs        ? ?/sec    1.00     95.5±0.49µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Jul 23, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing fast_path_for_bit_map_scan (3d6bcea) to 82821e5 diff
BENCH_NAME=sort_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench sort_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=fast_path_for_bit_map_scan
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jul 23, 2025

🤖: Benchmark completed

Details

group                                                   fast_path_for_bit_map_scan             main
-----                                                   --------------------------             ----
lexsort (bool, bool) 2^12                               1.01    116.3±0.62µs        ? ?/sec    1.00    115.7±0.35µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.00    157.2±0.31µs        ? ?/sec    1.00    157.6±0.39µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.01     45.1±0.21µs        ? ?/sec    1.00     44.5±0.07µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.00    212.6±0.75µs        ? ?/sec    1.00    212.2±1.83µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     38.8±0.04µs        ? ?/sec    1.02     39.5±0.11µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.00     41.3±0.06µs        ? ?/sec    1.00     41.2±0.05µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.01     79.2±0.39µs        ? ?/sec    1.00     78.5±0.09µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.01    212.8±0.40µs        ? ?/sec    1.00    210.8±0.61µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.01     52.7±0.15µs        ? ?/sec    1.00     52.1±0.28µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.01    247.1±0.77µs        ? ?/sec    1.00    245.1±0.78µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.01     85.3±0.29µs        ? ?/sec    1.00     84.2±0.28µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.01     86.2±0.17µs        ? ?/sec    1.00     85.0±0.22µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.01     95.6±0.22µs        ? ?/sec    1.00     94.6±0.21µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.01    247.2±0.44µs        ? ?/sec    1.00    244.7±0.44µs        ? ?/sec
rank f32 2^12                                           1.07     72.5±0.30µs        ? ?/sec    1.00     68.0±0.28µs        ? ?/sec
rank f32 nulls 2^12                                     1.06     37.4±0.13µs        ? ?/sec    1.00     35.4±0.08µs        ? ?/sec
rank string[10] 2^12                                    1.00    254.9±0.56µs        ? ?/sec    1.00    255.6±0.51µs        ? ?/sec
rank string[10] nulls 2^12                              1.00    123.1±0.16µs        ? ?/sec    1.00    123.3±0.25µs        ? ?/sec
sort f32 2^12                                           1.01     60.6±0.54µs        ? ?/sec    1.00     59.8±0.44µs        ? ?/sec
sort f32 nulls 2^12                                     1.01     29.2±0.11µs        ? ?/sec    1.00     28.9±0.09µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.00     39.8±0.09µs        ? ?/sec    1.37     54.4±0.20µs        ? ?/sec
sort f32 to indices 2^12                                1.00     72.0±0.25µs        ? ?/sec    1.05     75.9±0.25µs        ? ?/sec
sort i32 2^10                                           1.00      7.3±0.01µs        ? ?/sec    1.00      7.3±0.02µs        ? ?/sec
sort i32 2^12                                           1.00     35.6±0.16µs        ? ?/sec    1.01     35.8±0.11µs        ? ?/sec
sort i32 nulls 2^10                                     1.01      4.8±0.01µs        ? ?/sec    1.00      4.8±0.01µs        ? ?/sec
sort i32 nulls 2^12                                     1.00     20.3±0.06µs        ? ?/sec    1.00     20.3±0.04µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.03      8.1±0.08µs        ? ?/sec    1.00      7.8±0.02µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.00     34.8±0.08µs        ? ?/sec    1.26     43.9±0.16µs        ? ?/sec
sort i32 to indices 2^10                                1.13     12.9±0.02µs        ? ?/sec    1.00     11.4±0.02µs        ? ?/sec
sort i32 to indices 2^12                                1.15     63.3±0.40µs        ? ?/sec    1.00     55.1±0.69µs        ? ?/sec
sort primitive run 2^12                                 1.00      6.4±0.01µs        ? ?/sec    1.12      7.2±0.02µs        ? ?/sec
sort primitive run to indices 2^12                      1.00      8.4±0.02µs        ? ?/sec    1.06      8.9±0.01µs        ? ?/sec
sort string[0-100] nulls to indices 2^12                1.00    151.4±0.36µs        ? ?/sec    1.11    167.9±0.28µs        ? ?/sec
sort string[0-100] to indices 2^12                      1.00    332.4±0.74µs        ? ?/sec    1.00    332.1±1.10µs        ? ?/sec
sort string[0-10] nulls to indices 2^12                 1.00    120.3±2.63µs        ? ?/sec    1.17    141.2±0.33µs        ? ?/sec
sort string[0-10] to indices 2^12                       1.00    260.1±0.49µs        ? ?/sec    1.01    263.0±0.59µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.00    131.6±0.44µs        ? ?/sec    1.13    148.6±0.52µs        ? ?/sec
sort string[0-400] to indices 2^12                      1.00    283.5±0.83µs        ? ?/sec    1.00    282.6±0.80µs        ? ?/sec
sort string[1000] nulls to indices 2^12                 1.00    123.1±0.49µs        ? ?/sec    1.14    140.0±0.45µs        ? ?/sec
sort string[1000] to indices 2^12                       1.00    250.5±0.86µs        ? ?/sec    1.01    252.6±1.11µs        ? ?/sec
sort string[100] nulls to indices 2^12                  1.00    121.6±0.39µs        ? ?/sec    1.11    135.5±0.29µs        ? ?/sec
sort string[100] to indices 2^12                        1.00    248.3±0.94µs        ? ?/sec    1.00    248.9±0.61µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.00    154.7±0.32µs        ? ?/sec    1.13    174.3±0.26µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.00    317.6±0.53µs        ? ?/sec    1.01    321.4±1.01µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.00    119.8±0.36µs        ? ?/sec    1.14    136.8±0.37µs        ? ?/sec
sort string[10] to indices 2^12                         1.00    247.5±0.69µs        ? ?/sec    1.01    249.2±0.60µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.00     63.9±0.17µs        ? ?/sec    1.25     80.1±0.15µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.00    133.8±0.27µs        ? ?/sec    1.00    134.4±0.27µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.00     46.8±0.42µs        ? ?/sec    1.31     61.3±0.34µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00    104.2±0.27µs        ? ?/sec    1.00    104.3±0.26µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.00     44.6±0.32µs        ? ?/sec    1.31     58.4±0.41µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.00     94.6±0.39µs        ? ?/sec    1.01     95.4±0.40µs        ? ?/sec

@zhuqi-lucas
Copy link
Contributor Author

It looks like one regression from benchmark:

sort i32 to indices 2^10                                1.13     12.9±0.02µs        ? ?/sec    1.00     11.4±0.02µs        ? ?/sec
sort i32 to indices 2^12                                1.15     63.3±0.15µs        ? ?/sec    1.00     55.0±0.15µs        ? ?/sec

But i can't reproduce from my local. And the code is not changing anything for the not null cases.

@alamb
Copy link
Contributor

alamb commented Jul 23, 2025

It looks like one regression from benchmark:

sort i32 to indices 2^10                                1.13     12.9±0.02µs        ? ?/sec    1.00     11.4±0.02µs        ? ?/sec
sort i32 to indices 2^12                                1.15     63.3±0.15µs        ? ?/sec    1.00     55.0±0.15µs        ? ?/sec

But i can't reproduce from my local. And the code is not changing anything for the not null cases.

I'll rerun -- it can potentially be something related to the test machine

@alamb
Copy link
Contributor

alamb commented Jul 23, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing fast_path_for_bit_map_scan (3d6bcea) to 82821e5 diff
BENCH_NAME=sort_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench sort_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=fast_path_for_bit_map_scan
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jul 23, 2025

🤖: Benchmark completed

Details

group                                                   fast_path_for_bit_map_scan             main
-----                                                   --------------------------             ----
lexsort (bool, bool) 2^12                               1.00    116.4±0.59µs        ? ?/sec    1.00    116.3±0.58µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.00    156.5±0.59µs        ? ?/sec    1.01    157.6±0.78µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.01     45.1±0.17µs        ? ?/sec    1.00     44.7±0.25µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.01    212.1±0.40µs        ? ?/sec    1.00    209.9±1.22µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     38.7±0.13µs        ? ?/sec    1.00     38.9±0.21µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.01     41.4±0.09µs        ? ?/sec    1.00     41.2±0.26µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.01     79.5±0.14µs        ? ?/sec    1.00     78.9±0.13µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.00    211.6±0.46µs        ? ?/sec    1.00    211.6±1.08µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.01     52.8±0.16µs        ? ?/sec    1.00     52.2±0.29µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.01    247.2±0.54µs        ? ?/sec    1.00    245.1±1.22µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.00     85.3±0.17µs        ? ?/sec    1.00     85.0±0.71µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.01     86.9±2.51µs        ? ?/sec    1.00     85.7±0.59µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.01     95.8±0.32µs        ? ?/sec    1.00     95.3±0.70µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.01    247.7±0.75µs        ? ?/sec    1.00    245.2±1.85µs        ? ?/sec
rank f32 2^12                                           1.06     72.5±0.31µs        ? ?/sec    1.00     68.4±0.72µs        ? ?/sec
rank f32 nulls 2^12                                     1.06     37.5±0.07µs        ? ?/sec    1.00     35.3±0.48µs        ? ?/sec
rank string[10] 2^12                                    1.00    253.1±0.32µs        ? ?/sec    1.01    254.9±0.53µs        ? ?/sec
rank string[10] nulls 2^12                              1.00    122.6±1.16µs        ? ?/sec    1.00    123.0±0.38µs        ? ?/sec
sort f32 2^12                                           1.01     60.6±0.65µs        ? ?/sec    1.00     60.2±0.52µs        ? ?/sec
sort f32 nulls 2^12                                     1.01     29.2±0.11µs        ? ?/sec    1.00     28.9±0.11µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.00     39.7±0.10µs        ? ?/sec    1.37     54.5±0.17µs        ? ?/sec
sort f32 to indices 2^12                                1.00     72.3±0.29µs        ? ?/sec    1.05     76.1±0.21µs        ? ?/sec
sort i32 2^10                                           1.00      7.3±0.02µs        ? ?/sec    1.00      7.3±0.01µs        ? ?/sec
sort i32 2^12                                           1.00     35.6±0.13µs        ? ?/sec    1.01     35.9±0.10µs        ? ?/sec
sort i32 nulls 2^10                                     1.00      4.8±0.02µs        ? ?/sec    1.00      4.8±0.01µs        ? ?/sec
sort i32 nulls 2^12                                     1.01     20.3±0.06µs        ? ?/sec    1.00     20.2±0.08µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.03      8.1±0.02µs        ? ?/sec    1.00      7.8±0.02µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.00     34.8±0.09µs        ? ?/sec    1.26     44.0±0.29µs        ? ?/sec
sort i32 to indices 2^10                                1.13     12.9±0.02µs        ? ?/sec    1.00     11.4±0.03µs        ? ?/sec
sort i32 to indices 2^12                                1.15     63.1±0.25µs        ? ?/sec    1.00     54.9±0.13µs        ? ?/sec
sort primitive run 2^12                                 1.00      6.3±0.01µs        ? ?/sec    1.11      7.0±0.03µs        ? ?/sec
sort primitive run to indices 2^12                      1.00      8.5±0.04µs        ? ?/sec    1.05      8.9±0.07µs        ? ?/sec
sort string[0-100] nulls to indices 2^12                1.00    152.3±0.46µs        ? ?/sec    1.11    168.5±0.55µs        ? ?/sec
sort string[0-100] to indices 2^12                      1.00    331.9±0.78µs        ? ?/sec    1.00    333.0±1.27µs        ? ?/sec
sort string[0-10] nulls to indices 2^12                 1.00    119.3±0.44µs        ? ?/sec    1.18    140.4±0.33µs        ? ?/sec
sort string[0-10] to indices 2^12                       1.00    260.0±0.55µs        ? ?/sec    1.01    262.2±0.56µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.00    131.3±0.40µs        ? ?/sec    1.13    148.6±0.36µs        ? ?/sec
sort string[0-400] to indices 2^12                      1.00    282.6±1.01µs        ? ?/sec    1.01    286.3±2.61µs        ? ?/sec
sort string[1000] nulls to indices 2^12                 1.00    123.4±0.41µs        ? ?/sec    1.12    138.6±0.58µs        ? ?/sec
sort string[1000] to indices 2^12                       1.00    251.0±1.96µs        ? ?/sec    1.00    251.9±2.17µs        ? ?/sec
sort string[100] nulls to indices 2^12                  1.00    121.3±0.50µs        ? ?/sec    1.10    134.0±0.24µs        ? ?/sec
sort string[100] to indices 2^12                        1.00    248.8±0.64µs        ? ?/sec    1.00    249.9±0.99µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.00    154.6±0.39µs        ? ?/sec    1.12    173.8±0.47µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.00    318.3±0.86µs        ? ?/sec    1.01    321.4±0.67µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.00    120.3±0.33µs        ? ?/sec    1.14    136.7±0.26µs        ? ?/sec
sort string[10] to indices 2^12                         1.01    248.1±0.44µs        ? ?/sec    1.00    244.9±0.78µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.00     63.8±0.16µs        ? ?/sec    1.26     80.2±0.21µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.00    133.6±0.18µs        ? ?/sec    1.01    134.4±0.23µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.00     46.7±0.33µs        ? ?/sec    1.32     61.5±0.36µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00    104.2±0.26µs        ? ?/sec    1.00    104.1±0.24µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.00     44.6±0.34µs        ? ?/sec    1.31     58.6±0.39µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.00     94.6±0.61µs        ? ?/sec    1.01     95.5±0.53µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Jul 23, 2025

sort i32 to indices 2^10                                1.13     12.9±0.02µs        ? ?/sec    1.00     11.4±0.03µs        ? ?/sec
sort i32 to indices 2^12                                1.15     63.1±0.25µs        ? ?/sec    1.00     54.9±0.13µs        ? ?/sec
s

Weird -- we get the same results. Maybe there is more code or something now so cache effects come into play on the test machine 🤔 I am inclined to go with this PR to get the improvements in general

@zhuqi-lucas
Copy link
Contributor Author

sort i32 to indices 2^10                                1.13     12.9±0.02µs        ? ?/sec    1.00     11.4±0.03µs        ? ?/sec
sort i32 to indices 2^12                                1.15     63.1±0.25µs        ? ?/sec    1.00     54.9±0.13µs        ? ?/sec
s

Weird -- we get the same results. Maybe there is more code or something now so cache effects come into play on the test machine 🤔 I am inclined to go with this PR to get the improvements in general

Thank you @alamb , i changed to use a linux now to reproduce also, but it can't reproduce also both mac and linux:

sort i32 to indices 2^10
                        time:   [12.857 µs 12.871 µs 12.886 µs]
                        change: [0.8566% −0.6047% −0.3347%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

sort i32 to indices 2^12
                        time:   [57.662 µs 57.817 µs 58.056 µs]
                        change: [3.9800% −3.5589% −3.1125%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe

@alamb
Copy link
Contributor

alamb commented Jul 23, 2025

Maybe we need to crank up the test somehow -- trying to measure changes in usec may be too subject to noise 🤔

@zhuqi-lucas
Copy link
Contributor Author

sort i32 to indices 2^10

Good point @alamb , i try to increase the length of i32, it still no regression for this PR:

sort i32 to indices 2^16
                        time:   [565.57 µs 566.31 µs 567.12 µs]
                        change: [0.0297% +0.2001% +0.4033%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

sort i32 to indices 2^18
                        time:   [2.7443 ms 2.7497 ms 2.7554 ms]
                        change: [+0.1844% +0.4567% +0.7176%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

@alamb alamb changed the title Perf: Support partition_validity to use fast path for bit map scan Perf: improve sort via partition_validity to use fast path for bit map scan (up to 30% faster) Jul 24, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zhuqi-lucas

@jhorstmann / @Dandandan I wonder what your opinions on merging this PR is

}
}

impl<'a> Iterator for BitIndexU32Iterator<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conclusion from #7979 is that the u32 specific iterator is worth a 3-5% improvement: #7979 (comment)

Given that I think this PR makes sense to me

@alamb
Copy link
Contributor

alamb commented Jul 28, 2025

I plan to merge this PR tomorrow unless anyone else would like additional time to review

@alamb alamb merged commit 625e6ee into apache:main Jul 29, 2025
26 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 29, 2025

Thanks again @zhuqi-lucas !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants