Perf: improve sort via `partition_validity` to use fast path for bit map scan (up to 30% faster) #7962

zhuqi-lucas · 2025-07-18T13:07:44Z

Which issue does this PR close?

This PR is follow-up for:

#7937

I want to experiment the performance for Using word-level (u64) bit scanning:

Details:

#7937 (review)

Rationale for this change

Using word-level (u64) bit scanning

Use set_indices to implement this, but we need u32 index , so i also add set_indices_u32, the performance shows %7 improvement comparing to set_indices then to case to u32.

What changes are included in this PR?

Using word-level (u64) bit scanning

Use set_indices to implement this, but we need u32 index , so i also add set_indices_u32, the performance shows %7 improvement comparing to set_indices then to case to u32.

Are these changes tested?

Yes, add unit test also fuzz testing, also existed testing coverage sort fuzz.

Are there any user-facing changes?

No

…ap_scan

zhuqi-lucas · 2025-07-18T13:09:40Z

Performance result:

critcmp  --filter "nulls to indices" fast_path_for_bit_map_scan  main
group                                                   fast_path_for_bit_map_scan             main
-----                                                   --------------------------             ----
sort f32 nulls to indices 2^12                          1.00     17.3±0.12µs        ? ?/sec    1.21     20.8±0.17µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.00      3.6±0.05µs        ? ?/sec    1.28      4.6±0.03µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.00     14.9±0.10µs        ? ?/sec    1.24     18.5±0.14µs        ? ?/sec
sort string[0-100] nulls to indices 2^12                1.00     54.9±0.42µs        ? ?/sec    1.08     59.4±1.11µs        ? ?/sec
sort string[0-10] nulls to indices 2^12                 1.00     71.8±4.21µs        ? ?/sec    1.03     74.0±2.10µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.00     54.9±0.54µs        ? ?/sec    1.06     58.4±0.50µs        ? ?/sec
sort string[1000] nulls to indices 2^12                 1.00     53.8±0.42µs        ? ?/sec    1.07     57.7±0.43µs        ? ?/sec
sort string[100] nulls to indices 2^12                  1.00     53.1±0.55µs        ? ?/sec    1.07     56.7±0.56µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.00     66.9±0.50µs        ? ?/sec    1.06     70.7±0.62µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.00     50.7±0.46µs        ? ?/sec    1.07     54.3±0.48µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.00     27.1±0.23µs        ? ?/sec    1.14     30.9±0.30µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.00     24.0±0.22µs        ? ?/sec    1.16     27.8±0.23µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.00     23.4±0.18µs        ? ?/sec    1.15     27.0±0.20µs        ? ?/sec

alamb

Thank you @zhuqi-lucas -- that is very impressive benchmark results

I think for code like this we should add some more testing to cover all the corner cases -- I left some suggestions. Let me know what you think

alamb · 2025-07-18T17:42:48Z

arrow-ord/src/sort.rs

+/// Scans the null bitmap and partitions valid/null indices efficiently.
+/// Uses bit-level operations to extract bit positions.
+/// This function is only called when nulls exist.
+#[inline(always)]


this does look quite cool @zhuqi-lucas - if we want to go with this approach, I think we should add some more testing for this -- perhaps via a fuzz test that:

Calls partition_validitiy_scan with random Boolean arrays

Computed the expected results using a different algoithm (perhaps the original partition)

Compares the results

We should also make sure that we test that this works when slicing the arrays (array.slice(3, array.len() - 3) for example

Thank you @alamb for review , i added testing in latest PR, and i also using our existed logic to do this. So it will be more safe.

Dandandan · 2025-07-18T18:41:33Z

arrow-ord/src/sort.rs

+            // Convert 8 bytes into a u64 word in little-endian order
+            let w = u64::from_le_bytes(buffer[start..start + 8].try_into().unwrap());
+
+            // Iterate over each set bit in `z` (null mask) using a bit-parallel


Isnt this the same as set_indices? Can we reuse it?

Thank you @Dandandan for review and good suggestion, i am using set_indices_32 in latest PR because we need Vec for u32 here, and it's similar to set_indices, and i found to use set_indices_32 has 7% peformance improvement comparing to use set_indices then to change to u32.

Dandandan

.

jhorstmann · 2025-07-18T19:02:41Z

arrow-ord/src/sort.rs

+        let n_words = buffer.len() / 8; // number of 64-bit chunks in bitmap
+
+        // Process the bitmap in 64-bit (8 byte) chunks for efficiency
+        for word_idx in 0..n_words {


I think this needs to use UnalignedBitChunk to deal with arbitrarily sliced arrays.

Fascinating results! I did not expect this to be faster since we still have to visit each bit once. Must be related to branch prediction, or just writing to one of the slices at a time.

Thank you @jhorstmann for review, i try to use set_indices in latest PR. It should be more safe and clear.

…ap_scan

zhuqi-lucas · 2025-07-19T05:12:58Z

Thank you @alamb @Dandandan @jhorstmann for review.

I addressed comments in latest PR, and also added rich tests. Thanks!

zhuqi-lucas · 2025-07-19T05:24:39Z

Latest result for new implement, it has a little regression, but still promising result and the code is more readable and clear:

critcmp  --filter "nulls to indices" fast_path_for_bit_map_scan  main
group                                                   fast_path_for_bit_map_scan             main
-----                                                   --------------------------             ----
sort f32 nulls to indices 2^12                          1.00     17.7±0.16µs        ? ?/sec    1.18     20.9±0.18µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.00      3.8±0.03µs        ? ?/sec    1.20      4.5±0.04µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.00     15.3±0.18µs        ? ?/sec    1.20     18.4±0.14µs        ? ?/sec
sort string[0-100] nulls to indices 2^12                1.00     55.7±0.85µs        ? ?/sec    1.06     59.0±0.77µs        ? ?/sec
sort string[0-10] nulls to indices 2^12                 1.00     70.3±2.88µs        ? ?/sec    1.04     73.2±1.74µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.00     55.5±0.48µs        ? ?/sec    1.05     58.1±0.68µs        ? ?/sec
sort string[1000] nulls to indices 2^12                 1.00     54.7±0.59µs        ? ?/sec    1.05     57.5±0.62µs        ? ?/sec
sort string[100] nulls to indices 2^12                  1.00     54.0±0.64µs        ? ?/sec    1.05     56.6±0.49µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.00     68.2±0.69µs        ? ?/sec    1.05     71.5±0.76µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.00     51.5±0.43µs        ? ?/sec    1.05     54.4±0.59µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.00     27.9±0.27µs        ? ?/sec    1.11     30.9±0.21µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.00     24.8±0.25µs        ? ?/sec    1.11     27.6±0.26µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.00     24.2±0.22µs        ? ?/sec    1.12     27.1±0.20µs        ? ?/sec

zhuqi-lucas · 2025-07-19T05:38:43Z

arrow-buffer/src/util/bit_iterator.rs

@@ -323,4 +380,110 @@ mod tests {
        let mask = &[223, 23];
        BitIterator::new(mask, 17, 0);
    }
+
+    #[test]
+    fn test_bit_index_u32_iterator_basic() {


These tests make sure index_u32 is same with original usize result.

zhuqi-lucas · 2025-07-19T05:39:23Z

arrow-ord/src/sort.rs

+    }
+
+    #[test]
+    fn fuzz_partition_validity() {


These tests compare to make sure our partition null logic is same as before.

alamb · 2025-07-22T18:39:17Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing fast_path_for_bit_map_scan (556e673) to 82821e5 diff
BENCH_NAME=sort_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench sort_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=fast_path_for_bit_map_scan
Results will be posted here when complete

alamb · 2025-07-22T18:57:59Z

🤖: Benchmark completed

Details

group                                                   fast_path_for_bit_map_scan             main
-----                                                   --------------------------             ----
lexsort (bool, bool) 2^12                               1.00    116.3±0.41µs        ? ?/sec    1.00    116.5±0.37µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.00    156.4±0.30µs        ? ?/sec    1.01    157.7±0.28µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.01     45.0±0.07µs        ? ?/sec    1.00     44.6±0.11µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.01    211.6±0.38µs        ? ?/sec    1.00    209.4±0.49µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     39.5±0.12µs        ? ?/sec    1.01     39.8±0.09µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.01     41.6±0.13µs        ? ?/sec    1.00     41.2±0.14µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.00     78.8±0.13µs        ? ?/sec    1.00     78.6±0.29µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.01    212.3±0.37µs        ? ?/sec    1.00    209.7±0.50µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.01     52.8±0.18µs        ? ?/sec    1.00     52.2±0.28µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.01    247.2±0.69µs        ? ?/sec    1.00    245.3±0.48µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.01     85.1±0.21µs        ? ?/sec    1.00     84.4±0.24µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.01     86.1±0.14µs        ? ?/sec    1.00     85.3±0.24µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.01     95.6±0.20µs        ? ?/sec    1.00     94.7±0.17µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.01    247.0±0.38µs        ? ?/sec    1.00    245.5±0.59µs        ? ?/sec
rank f32 2^12                                           1.06     72.7±0.46µs        ? ?/sec    1.00     68.5±1.30µs        ? ?/sec
rank f32 nulls 2^12                                     1.06     37.5±0.07µs        ? ?/sec    1.00     35.2±0.07µs        ? ?/sec
rank string[10] 2^12                                    1.03    263.4±0.88µs        ? ?/sec    1.00    255.7±0.85µs        ? ?/sec
rank string[10] nulls 2^12                              1.00    122.2±0.67µs        ? ?/sec    1.01    123.6±0.38µs        ? ?/sec
sort f32 2^12                                           1.00     60.3±0.36µs        ? ?/sec    1.00     60.2±0.45µs        ? ?/sec
sort f32 nulls 2^12                                     1.01     29.2±0.21µs        ? ?/sec    1.00     28.9±0.09µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.00     39.7±0.11µs        ? ?/sec    1.38     54.7±0.20µs        ? ?/sec
sort f32 to indices 2^12                                1.00     72.3±0.27µs        ? ?/sec    1.05     76.1±0.22µs        ? ?/sec
sort i32 2^10                                           1.00      7.3±0.01µs        ? ?/sec    1.00      7.3±0.01µs        ? ?/sec
sort i32 2^12                                           1.00     35.8±0.10µs        ? ?/sec    1.00     35.9±0.26µs        ? ?/sec
sort i32 nulls 2^10                                     1.00      4.8±0.01µs        ? ?/sec    1.00      4.8±0.02µs        ? ?/sec
sort i32 nulls 2^12                                     1.00     20.2±0.05µs        ? ?/sec    1.00     20.3±0.06µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.04      8.1±0.01µs        ? ?/sec    1.00      7.8±0.02µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.00     34.8±0.10µs        ? ?/sec    1.26     43.8±0.11µs        ? ?/sec
sort i32 to indices 2^10                                1.13     13.0±0.02µs        ? ?/sec    1.00     11.4±0.02µs        ? ?/sec
sort i32 to indices 2^12                                1.15     63.4±0.22µs        ? ?/sec    1.00     55.0±0.18µs        ? ?/sec
sort primitive run 2^12                                 1.00      6.3±0.01µs        ? ?/sec    1.11      7.0±0.01µs        ? ?/sec
sort primitive run to indices 2^12                      1.00      8.4±0.02µs        ? ?/sec    1.06      8.9±0.04µs        ? ?/sec
sort string[0-100] nulls to indices 2^12                1.00    151.8±0.40µs        ? ?/sec    1.11    168.1±0.46µs        ? ?/sec
sort string[0-100] to indices 2^12                      1.00    332.4±0.59µs        ? ?/sec    1.00    331.4±0.62µs        ? ?/sec
sort string[0-10] nulls to indices 2^12                 1.00    119.2±0.45µs        ? ?/sec    1.18    140.9±0.45µs        ? ?/sec
sort string[0-10] to indices 2^12                       1.00    262.8±0.64µs        ? ?/sec    1.00    263.1±0.68µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.00    131.7±0.40µs        ? ?/sec    1.13    148.8±0.37µs        ? ?/sec
sort string[0-400] to indices 2^12                      1.00    283.2±1.65µs        ? ?/sec    1.01    284.6±1.12µs        ? ?/sec
sort string[1000] nulls to indices 2^12                 1.00    124.5±0.40µs        ? ?/sec    1.11    138.7±0.59µs        ? ?/sec
sort string[1000] to indices 2^12                       1.00    252.6±1.48µs        ? ?/sec    1.00    251.9±1.93µs        ? ?/sec
sort string[100] nulls to indices 2^12                  1.00    118.8±0.55µs        ? ?/sec    1.13    134.0±0.32µs        ? ?/sec
sort string[100] to indices 2^12                        1.00    248.1±1.20µs        ? ?/sec    1.01    250.2±1.82µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.00    153.9±0.64µs        ? ?/sec    1.13    173.9±0.39µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.00    318.3±1.31µs        ? ?/sec    1.01    322.0±0.73µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.00    120.5±1.15µs        ? ?/sec    1.15    138.2±0.28µs        ? ?/sec
sort string[10] to indices 2^12                         1.00    243.5±0.48µs        ? ?/sec    1.01    245.7±0.59µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.00     64.1±0.29µs        ? ?/sec    1.25     80.3±0.32µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.00    133.6±0.53µs        ? ?/sec    1.01    134.7±0.41µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.00     46.9±0.40µs        ? ?/sec    1.32     61.8±0.43µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00    104.3±0.39µs        ? ?/sec    1.00    104.0±0.29µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.00     44.6±0.31µs        ? ?/sec    1.31     58.3±0.29µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.00     95.1±0.58µs        ? ?/sec    1.00     95.4±0.58µs        ? ?/sec

alamb · 2025-07-22T20:17:41Z

Looks like some pretty good improvement to me. Thank you @zhuqi-lucas

I need to spend some time studying this code in detail for correctness and I will hope to do so tomorrow

zhuqi-lucas · 2025-07-23T04:32:08Z

Looks like some pretty good improvement to me. Thank you @zhuqi-lucas

I need to spend some time studying this code in detail for correctness and I will hope to do so tomorrow

Thank you @alamb !

jhorstmann · 2025-07-23T08:15:26Z

arrow-buffer/src/util/bit_iterator.rs

+    }
+}
+
+impl<'a> Iterator for BitIndexU32Iterator<'a> {


Is this implementation somehow more performant than using the existing BitIndexIterator and casting its items to u32? The only difference I see is in the masking of the lowest bit, ^= 1 << bit_pos vs &= self.curr - 1, but I think llvm would know that those are equivalent. If it makes a difference, then we should adjust BitIndexIterator the same way.

Thank you @jhorstmann for good question, actually it is more performant than using the existing BitIndexIterator because we cast directly to u32. But the BitIndexIterator will cast it to usize, so when we use BitIndexIterator, we need to cast from usize to u32, when i was testing, it caused the slowness.

The ^= 1 << bit_pos vs &= self.curr - 1, the performance almost same, it will not show difference, so i can use any of them.

I think i may change to a macro, so it will look more clear.

I will do a test to compare the performance too

Update: made #7979 and I queued up benchmark runs

Thank you @alamb !

I will do a test to compare the performance too

Update: made #7979 and I queued up benchmark runs

Conclusion from #7979 is that the u32 specific iterator is worth a 3-5% improvement: #7979 (comment)

Given that I think this PR makes sense to me

alamb · 2025-07-23T10:57:18Z

arrow-ord/src/sort.rs

+
+            // also test a sliced view
+            if len >= 4 {
+                let slice = array.slice(2, len - 4);


I recommend picking a random slice not just always by 2

Specifically I think it is important to ensure we slice some arrays by more than 64 to ensure they have to skip an entire 64 bit word in addition to having an offset

Thank you @alamb for good suggestion! I will address it.

Addressed it in latest PR.

alamb · 2025-07-23T11:46:19Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing fast_path_for_bit_map_scan (3d6bcea) to 82821e5 diff
BENCH_NAME=sort_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench sort_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=fast_path_for_bit_map_scan
Results will be posted here when complete

alamb · 2025-07-23T12:04:59Z

🤖: Benchmark completed

Details

group                                                   fast_path_for_bit_map_scan             main
-----                                                   --------------------------             ----
lexsort (bool, bool) 2^12                               1.00    116.0±0.54µs        ? ?/sec    1.00    116.2±0.41µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.00    156.4±0.25µs        ? ?/sec    1.01    157.8±0.38µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.00     45.1±0.09µs        ? ?/sec    1.01     45.4±0.09µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.01    211.6±0.55µs        ? ?/sec    1.00    209.7±0.37µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     38.8±0.06µs        ? ?/sec    1.00     38.9±0.09µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.00     41.3±0.08µs        ? ?/sec    1.01     41.5±0.13µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.00     78.6±0.12µs        ? ?/sec    1.00     78.5±0.18µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.01    214.1±0.29µs        ? ?/sec    1.00    211.2±0.33µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.01     52.6±0.55µs        ? ?/sec    1.00     52.3±0.10µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.01    247.0±0.40µs        ? ?/sec    1.00    245.5±0.51µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.00     84.7±0.16µs        ? ?/sec    1.00     84.6±0.32µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.00     85.6±0.19µs        ? ?/sec    1.00     85.6±0.35µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.00     95.0±0.23µs        ? ?/sec    1.00     95.1±0.23µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.01    247.9±4.36µs        ? ?/sec    1.00    244.9±0.43µs        ? ?/sec
rank f32 2^12                                           1.07     72.8±0.29µs        ? ?/sec    1.00     67.9±0.27µs        ? ?/sec
rank f32 nulls 2^12                                     1.06     37.4±0.05µs        ? ?/sec    1.00     35.1±0.06µs        ? ?/sec
rank string[10] 2^12                                    1.00    251.7±0.40µs        ? ?/sec    1.01    255.1±0.47µs        ? ?/sec
rank string[10] nulls 2^12                              1.00    121.5±0.23µs        ? ?/sec    1.01    123.0±0.47µs        ? ?/sec
sort f32 2^12                                           1.01     60.5±0.61µs        ? ?/sec    1.00     59.9±0.43µs        ? ?/sec
sort f32 nulls 2^12                                     1.01     29.1±0.11µs        ? ?/sec    1.00     28.9±0.11µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.00     39.8±0.16µs        ? ?/sec    1.36     54.3±0.11µs        ? ?/sec
sort f32 to indices 2^12                                1.00     72.2±0.27µs        ? ?/sec    1.05     76.0±0.33µs        ? ?/sec
sort i32 2^10                                           1.00      7.3±0.01µs        ? ?/sec    1.00      7.3±0.01µs        ? ?/sec
sort i32 2^12                                           1.00     35.8±0.13µs        ? ?/sec    1.00     35.9±0.11µs        ? ?/sec
sort i32 nulls 2^10                                     1.01      4.8±0.01µs        ? ?/sec    1.00      4.8±0.01µs        ? ?/sec
sort i32 nulls 2^12                                     1.00     20.3±0.05µs        ? ?/sec    1.00     20.3±0.05µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.04      8.1±0.01µs        ? ?/sec    1.00      7.8±0.01µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.00     34.8±0.09µs        ? ?/sec    1.26     43.9±0.18µs        ? ?/sec
sort i32 to indices 2^10                                1.13     12.9±0.02µs        ? ?/sec    1.00     11.4±0.02µs        ? ?/sec
sort i32 to indices 2^12                                1.15     63.3±0.15µs        ? ?/sec    1.00     55.0±0.15µs        ? ?/sec
sort primitive run 2^12                                 1.00      6.5±0.01µs        ? ?/sec    1.08      7.0±0.01µs        ? ?/sec
sort primitive run to indices 2^12                      1.00      8.4±0.01µs        ? ?/sec    1.06      8.9±0.02µs        ? ?/sec
sort string[0-100] nulls to indices 2^12                1.00    153.1±1.38µs        ? ?/sec    1.09    167.3±0.33µs        ? ?/sec
sort string[0-100] to indices 2^12                      1.00    333.2±3.43µs        ? ?/sec    1.00    331.6±0.99µs        ? ?/sec
sort string[0-10] nulls to indices 2^12                 1.00    118.5±0.22µs        ? ?/sec    1.19    140.6±0.39µs        ? ?/sec
sort string[0-10] to indices 2^12                       1.01    262.6±0.63µs        ? ?/sec    1.00    261.0±0.36µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.00    132.0±1.89µs        ? ?/sec    1.12    148.3±0.32µs        ? ?/sec
sort string[0-400] to indices 2^12                      1.00    283.4±3.33µs        ? ?/sec    1.01    285.8±0.75µs        ? ?/sec
sort string[1000] nulls to indices 2^12                 1.00    124.8±1.85µs        ? ?/sec    1.12    139.1±0.40µs        ? ?/sec
sort string[1000] to indices 2^12                       1.01    254.8±5.59µs        ? ?/sec    1.00    252.2±1.27µs        ? ?/sec
sort string[100] nulls to indices 2^12                  1.00    118.7±2.56µs        ? ?/sec    1.13    133.9±1.90µs        ? ?/sec
sort string[100] to indices 2^12                        1.00    247.5±3.20µs        ? ?/sec    1.00    248.3±1.20µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.00    154.3±0.40µs        ? ?/sec    1.13    173.6±0.30µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.00    316.7±1.04µs        ? ?/sec    1.01    321.1±0.50µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.00    120.0±1.33µs        ? ?/sec    1.14    136.6±0.42µs        ? ?/sec
sort string[10] to indices 2^12                         1.00    245.0±2.18µs        ? ?/sec    1.05    256.2±0.69µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.00     63.8±0.15µs        ? ?/sec    1.26     80.1±0.55µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.00    133.6±0.32µs        ? ?/sec    1.01    134.4±0.21µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.00     46.9±0.34µs        ? ?/sec    1.32     61.9±0.59µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00    104.2±0.32µs        ? ?/sec    1.00    104.1±0.36µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.00     44.6±0.28µs        ? ?/sec    1.31     58.5±0.31µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.00     95.2±0.34µs        ? ?/sec    1.00     95.5±0.49µs        ? ?/sec

alamb · 2025-07-23T12:05:02Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing fast_path_for_bit_map_scan (3d6bcea) to 82821e5 diff
BENCH_NAME=sort_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench sort_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=fast_path_for_bit_map_scan
Results will be posted here when complete

alamb · 2025-07-23T12:23:40Z

🤖: Benchmark completed

Details

group                                                   fast_path_for_bit_map_scan             main
-----                                                   --------------------------             ----
lexsort (bool, bool) 2^12                               1.01    116.3±0.62µs        ? ?/sec    1.00    115.7±0.35µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.00    157.2±0.31µs        ? ?/sec    1.00    157.6±0.39µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.01     45.1±0.21µs        ? ?/sec    1.00     44.5±0.07µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.00    212.6±0.75µs        ? ?/sec    1.00    212.2±1.83µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     38.8±0.04µs        ? ?/sec    1.02     39.5±0.11µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.00     41.3±0.06µs        ? ?/sec    1.00     41.2±0.05µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.01     79.2±0.39µs        ? ?/sec    1.00     78.5±0.09µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.01    212.8±0.40µs        ? ?/sec    1.00    210.8±0.61µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.01     52.7±0.15µs        ? ?/sec    1.00     52.1±0.28µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.01    247.1±0.77µs        ? ?/sec    1.00    245.1±0.78µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.01     85.3±0.29µs        ? ?/sec    1.00     84.2±0.28µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.01     86.2±0.17µs        ? ?/sec    1.00     85.0±0.22µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.01     95.6±0.22µs        ? ?/sec    1.00     94.6±0.21µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.01    247.2±0.44µs        ? ?/sec    1.00    244.7±0.44µs        ? ?/sec
rank f32 2^12                                           1.07     72.5±0.30µs        ? ?/sec    1.00     68.0±0.28µs        ? ?/sec
rank f32 nulls 2^12                                     1.06     37.4±0.13µs        ? ?/sec    1.00     35.4±0.08µs        ? ?/sec
rank string[10] 2^12                                    1.00    254.9±0.56µs        ? ?/sec    1.00    255.6±0.51µs        ? ?/sec
rank string[10] nulls 2^12                              1.00    123.1±0.16µs        ? ?/sec    1.00    123.3±0.25µs        ? ?/sec
sort f32 2^12                                           1.01     60.6±0.54µs        ? ?/sec    1.00     59.8±0.44µs        ? ?/sec
sort f32 nulls 2^12                                     1.01     29.2±0.11µs        ? ?/sec    1.00     28.9±0.09µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.00     39.8±0.09µs        ? ?/sec    1.37     54.4±0.20µs        ? ?/sec
sort f32 to indices 2^12                                1.00     72.0±0.25µs        ? ?/sec    1.05     75.9±0.25µs        ? ?/sec
sort i32 2^10                                           1.00      7.3±0.01µs        ? ?/sec    1.00      7.3±0.02µs        ? ?/sec
sort i32 2^12                                           1.00     35.6±0.16µs        ? ?/sec    1.01     35.8±0.11µs        ? ?/sec
sort i32 nulls 2^10                                     1.01      4.8±0.01µs        ? ?/sec    1.00      4.8±0.01µs        ? ?/sec
sort i32 nulls 2^12                                     1.00     20.3±0.06µs        ? ?/sec    1.00     20.3±0.04µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.03      8.1±0.08µs        ? ?/sec    1.00      7.8±0.02µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.00     34.8±0.08µs        ? ?/sec    1.26     43.9±0.16µs        ? ?/sec
sort i32 to indices 2^10                                1.13     12.9±0.02µs        ? ?/sec    1.00     11.4±0.02µs        ? ?/sec
sort i32 to indices 2^12                                1.15     63.3±0.40µs        ? ?/sec    1.00     55.1±0.69µs        ? ?/sec
sort primitive run 2^12                                 1.00      6.4±0.01µs        ? ?/sec    1.12      7.2±0.02µs        ? ?/sec
sort primitive run to indices 2^12                      1.00      8.4±0.02µs        ? ?/sec    1.06      8.9±0.01µs        ? ?/sec
sort string[0-100] nulls to indices 2^12                1.00    151.4±0.36µs        ? ?/sec    1.11    167.9±0.28µs        ? ?/sec
sort string[0-100] to indices 2^12                      1.00    332.4±0.74µs        ? ?/sec    1.00    332.1±1.10µs        ? ?/sec
sort string[0-10] nulls to indices 2^12                 1.00    120.3±2.63µs        ? ?/sec    1.17    141.2±0.33µs        ? ?/sec
sort string[0-10] to indices 2^12                       1.00    260.1±0.49µs        ? ?/sec    1.01    263.0±0.59µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.00    131.6±0.44µs        ? ?/sec    1.13    148.6±0.52µs        ? ?/sec
sort string[0-400] to indices 2^12                      1.00    283.5±0.83µs        ? ?/sec    1.00    282.6±0.80µs        ? ?/sec
sort string[1000] nulls to indices 2^12                 1.00    123.1±0.49µs        ? ?/sec    1.14    140.0±0.45µs        ? ?/sec
sort string[1000] to indices 2^12                       1.00    250.5±0.86µs        ? ?/sec    1.01    252.6±1.11µs        ? ?/sec
sort string[100] nulls to indices 2^12                  1.00    121.6±0.39µs        ? ?/sec    1.11    135.5±0.29µs        ? ?/sec
sort string[100] to indices 2^12                        1.00    248.3±0.94µs        ? ?/sec    1.00    248.9±0.61µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.00    154.7±0.32µs        ? ?/sec    1.13    174.3±0.26µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.00    317.6±0.53µs        ? ?/sec    1.01    321.4±1.01µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.00    119.8±0.36µs        ? ?/sec    1.14    136.8±0.37µs        ? ?/sec
sort string[10] to indices 2^12                         1.00    247.5±0.69µs        ? ?/sec    1.01    249.2±0.60µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.00     63.9±0.17µs        ? ?/sec    1.25     80.1±0.15µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.00    133.8±0.27µs        ? ?/sec    1.00    134.4±0.27µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.00     46.8±0.42µs        ? ?/sec    1.31     61.3±0.34µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00    104.2±0.27µs        ? ?/sec    1.00    104.3±0.26µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.00     44.6±0.32µs        ? ?/sec    1.31     58.4±0.41µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.00     94.6±0.39µs        ? ?/sec    1.01     95.4±0.40µs        ? ?/sec

zhuqi-lucas · 2025-07-23T12:27:15Z

It looks like one regression from benchmark:

sort i32 to indices 2^10                                1.13     12.9±0.02µs        ? ?/sec    1.00     11.4±0.02µs        ? ?/sec
sort i32 to indices 2^12                                1.15     63.3±0.15µs        ? ?/sec    1.00     55.0±0.15µs        ? ?/sec

But i can't reproduce from my local. And the code is not changing anything for the not null cases.

alamb · 2025-07-23T14:00:09Z

It looks like one regression from benchmark:

sort i32 to indices 2^10                                1.13     12.9±0.02µs        ? ?/sec    1.00     11.4±0.02µs        ? ?/sec
sort i32 to indices 2^12                                1.15     63.3±0.15µs        ? ?/sec    1.00     55.0±0.15µs        ? ?/sec

But i can't reproduce from my local. And the code is not changing anything for the not null cases.

I'll rerun -- it can potentially be something related to the test machine

alamb · 2025-07-23T14:01:42Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing fast_path_for_bit_map_scan (3d6bcea) to 82821e5 diff
BENCH_NAME=sort_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench sort_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=fast_path_for_bit_map_scan
Results will be posted here when complete

alamb · 2025-07-23T14:20:21Z

🤖: Benchmark completed

Details

group                                                   fast_path_for_bit_map_scan             main
-----                                                   --------------------------             ----
lexsort (bool, bool) 2^12                               1.00    116.4±0.59µs        ? ?/sec    1.00    116.3±0.58µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.00    156.5±0.59µs        ? ?/sec    1.01    157.6±0.78µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.01     45.1±0.17µs        ? ?/sec    1.00     44.7±0.25µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.01    212.1±0.40µs        ? ?/sec    1.00    209.9±1.22µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     38.7±0.13µs        ? ?/sec    1.00     38.9±0.21µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.01     41.4±0.09µs        ? ?/sec    1.00     41.2±0.26µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.01     79.5±0.14µs        ? ?/sec    1.00     78.9±0.13µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.00    211.6±0.46µs        ? ?/sec    1.00    211.6±1.08µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.01     52.8±0.16µs        ? ?/sec    1.00     52.2±0.29µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.01    247.2±0.54µs        ? ?/sec    1.00    245.1±1.22µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.00     85.3±0.17µs        ? ?/sec    1.00     85.0±0.71µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.01     86.9±2.51µs        ? ?/sec    1.00     85.7±0.59µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.01     95.8±0.32µs        ? ?/sec    1.00     95.3±0.70µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.01    247.7±0.75µs        ? ?/sec    1.00    245.2±1.85µs        ? ?/sec
rank f32 2^12                                           1.06     72.5±0.31µs        ? ?/sec    1.00     68.4±0.72µs        ? ?/sec
rank f32 nulls 2^12                                     1.06     37.5±0.07µs        ? ?/sec    1.00     35.3±0.48µs        ? ?/sec
rank string[10] 2^12                                    1.00    253.1±0.32µs        ? ?/sec    1.01    254.9±0.53µs        ? ?/sec
rank string[10] nulls 2^12                              1.00    122.6±1.16µs        ? ?/sec    1.00    123.0±0.38µs        ? ?/sec
sort f32 2^12                                           1.01     60.6±0.65µs        ? ?/sec    1.00     60.2±0.52µs        ? ?/sec
sort f32 nulls 2^12                                     1.01     29.2±0.11µs        ? ?/sec    1.00     28.9±0.11µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.00     39.7±0.10µs        ? ?/sec    1.37     54.5±0.17µs        ? ?/sec
sort f32 to indices 2^12                                1.00     72.3±0.29µs        ? ?/sec    1.05     76.1±0.21µs        ? ?/sec
sort i32 2^10                                           1.00      7.3±0.02µs        ? ?/sec    1.00      7.3±0.01µs        ? ?/sec
sort i32 2^12                                           1.00     35.6±0.13µs        ? ?/sec    1.01     35.9±0.10µs        ? ?/sec
sort i32 nulls 2^10                                     1.00      4.8±0.02µs        ? ?/sec    1.00      4.8±0.01µs        ? ?/sec
sort i32 nulls 2^12                                     1.01     20.3±0.06µs        ? ?/sec    1.00     20.2±0.08µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.03      8.1±0.02µs        ? ?/sec    1.00      7.8±0.02µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.00     34.8±0.09µs        ? ?/sec    1.26     44.0±0.29µs        ? ?/sec
sort i32 to indices 2^10                                1.13     12.9±0.02µs        ? ?/sec    1.00     11.4±0.03µs        ? ?/sec
sort i32 to indices 2^12                                1.15     63.1±0.25µs        ? ?/sec    1.00     54.9±0.13µs        ? ?/sec
sort primitive run 2^12                                 1.00      6.3±0.01µs        ? ?/sec    1.11      7.0±0.03µs        ? ?/sec
sort primitive run to indices 2^12                      1.00      8.5±0.04µs        ? ?/sec    1.05      8.9±0.07µs        ? ?/sec
sort string[0-100] nulls to indices 2^12                1.00    152.3±0.46µs        ? ?/sec    1.11    168.5±0.55µs        ? ?/sec
sort string[0-100] to indices 2^12                      1.00    331.9±0.78µs        ? ?/sec    1.00    333.0±1.27µs        ? ?/sec
sort string[0-10] nulls to indices 2^12                 1.00    119.3±0.44µs        ? ?/sec    1.18    140.4±0.33µs        ? ?/sec
sort string[0-10] to indices 2^12                       1.00    260.0±0.55µs        ? ?/sec    1.01    262.2±0.56µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.00    131.3±0.40µs        ? ?/sec    1.13    148.6±0.36µs        ? ?/sec
sort string[0-400] to indices 2^12                      1.00    282.6±1.01µs        ? ?/sec    1.01    286.3±2.61µs        ? ?/sec
sort string[1000] nulls to indices 2^12                 1.00    123.4±0.41µs        ? ?/sec    1.12    138.6±0.58µs        ? ?/sec
sort string[1000] to indices 2^12                       1.00    251.0±1.96µs        ? ?/sec    1.00    251.9±2.17µs        ? ?/sec
sort string[100] nulls to indices 2^12                  1.00    121.3±0.50µs        ? ?/sec    1.10    134.0±0.24µs        ? ?/sec
sort string[100] to indices 2^12                        1.00    248.8±0.64µs        ? ?/sec    1.00    249.9±0.99µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.00    154.6±0.39µs        ? ?/sec    1.12    173.8±0.47µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.00    318.3±0.86µs        ? ?/sec    1.01    321.4±0.67µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.00    120.3±0.33µs        ? ?/sec    1.14    136.7±0.26µs        ? ?/sec
sort string[10] to indices 2^12                         1.01    248.1±0.44µs        ? ?/sec    1.00    244.9±0.78µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.00     63.8±0.16µs        ? ?/sec    1.26     80.2±0.21µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.00    133.6±0.18µs        ? ?/sec    1.01    134.4±0.23µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.00     46.7±0.33µs        ? ?/sec    1.32     61.5±0.36µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00    104.2±0.26µs        ? ?/sec    1.00    104.1±0.24µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.00     44.6±0.34µs        ? ?/sec    1.31     58.6±0.39µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.00     94.6±0.61µs        ? ?/sec    1.01     95.5±0.53µs        ? ?/sec

alamb · 2025-07-23T14:26:00Z

sort i32 to indices 2^10                                1.13     12.9±0.02µs        ? ?/sec    1.00     11.4±0.03µs        ? ?/sec
sort i32 to indices 2^12                                1.15     63.1±0.25µs        ? ?/sec    1.00     54.9±0.13µs        ? ?/sec
s

Weird -- we get the same results. Maybe there is more code or something now so cache effects come into play on the test machine 🤔 I am inclined to go with this PR to get the improvements in general

zhuqi-lucas · 2025-07-23T15:05:51Z

sort i32 to indices 2^10                                1.13     12.9±0.02µs        ? ?/sec    1.00     11.4±0.03µs        ? ?/sec
sort i32 to indices 2^12                                1.15     63.1±0.25µs        ? ?/sec    1.00     54.9±0.13µs        ? ?/sec
s
Weird -- we get the same results. Maybe there is more code or something now so cache effects come into play on the test machine 🤔 I am inclined to go with this PR to get the improvements in general

Thank you @alamb , i changed to use a linux now to reproduce also, but it can't reproduce also both mac and linux:

sort i32 to indices 2^10
                        time:   [12.857 µs 12.871 µs 12.886 µs]
                        change: [−0.8566% −0.6047% −0.3347%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

sort i32 to indices 2^12
                        time:   [57.662 µs 57.817 µs 58.056 µs]
                        change: [−3.9800% −3.5589% −3.1125%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe

alamb · 2025-07-23T16:58:35Z

Maybe we need to crank up the test somehow -- trying to measure changes in usec may be too subject to noise 🤔

zhuqi-lucas · 2025-07-24T03:43:43Z

sort i32 to indices 2^10

Good point @alamb , i try to increase the length of i32, it still no regression for this PR:

sort i32 to indices 2^16
                        time:   [565.57 µs 566.31 µs 567.12 µs]
                        change: [−0.0297% +0.2001% +0.4033%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

sort i32 to indices 2^18
                        time:   [2.7443 ms 2.7497 ms 2.7554 ms]
                        change: [+0.1844% +0.4567% +0.7176%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

alamb

Thank you @zhuqi-lucas

@jhorstmann / @Dandandan I wonder what your opinions on merging this PR is

alamb · 2025-07-24T14:59:05Z

arrow-buffer/src/util/bit_iterator.rs

+    }
+}
+
+impl<'a> Iterator for BitIndexU32Iterator<'a> {


Conclusion from #7979 is that the u32 specific iterator is worth a 3-5% improvement: #7979 (comment)

Given that I think this PR makes sense to me

…_scan

alamb · 2025-07-28T19:46:02Z

I plan to merge this PR tomorrow unless anyone else would like additional time to review

alamb · 2025-07-29T15:06:36Z

Thanks again @zhuqi-lucas !

zhuqi-lucas added 4 commits July 17, 2025 23:41

Perf: use fast path for bit map scan for partition_validity

d8ffb41

Merge remote-tracking branch 'upstream/main' into fast_path_for_bit_m…

dddfd98

…ap_scan

fmt

d916497

polish comments

c7a0ae9

github-actions bot added the arrow Changes to the arrow crate label Jul 18, 2025

This was referenced Jul 18, 2025

Optimize partition_validity function used in sort kernels #7937

Merged

[EPIC] A collection of improvement for the performance for sort and compare and gc, etc #7802

Open

zhuqi-lucas force-pushed the fast_path_for_bit_map_scan branch from dbf747c to c7a0ae9 Compare July 18, 2025 15:40

alamb reviewed Jul 18, 2025

View reviewed changes

Dandandan reviewed Jul 18, 2025

View reviewed changes

jhorstmann reviewed Jul 18, 2025

View reviewed changes

zhuqi-lucas added 3 commits July 19, 2025 12:43

address comments

47dbdab

Merge remote-tracking branch 'upstream/main' into fast_path_for_bit_m…

4ea7810

…ap_scan

fix testing

556e673

zhuqi-lucas commented Jul 19, 2025

View reviewed changes

jhorstmann reviewed Jul 23, 2025

View reviewed changes

alamb reviewed Jul 23, 2025

View reviewed changes

alamb mentioned this pull request Jul 23, 2025

TESTING: Use standard BitIndexIterator instead of specialized u32 iterator #7979

Closed

address more testing cases

3d6bcea

alamb changed the title ~~Perf: Support partition_validity to use fast path for bit map scan~~ Perf: improve sort via partition_validity to use fast path for bit map scan (up to 30% faster) Jul 24, 2025

alamb approved these changes Jul 24, 2025

View reviewed changes

Merge remote-tracking branch 'apache/main' into fast_path_for_bit_map…

57d319c

…_scan

alamb merged commit 625e6ee into apache:main Jul 29, 2025
26 checks passed

Perf: improve sort via partition_validity to use fast path for bit map scan (up to 30% faster) #7962

Perf: improve sort via partition_validity to use fast path for bit map scan (up to 30% faster) #7962

Uh oh!

Conversation

zhuqi-lucas commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

zhuqi-lucas commented Jul 18, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas commented Jul 19, 2025

Uh oh!

zhuqi-lucas commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jul 22, 2025

Uh oh!

alamb commented Jul 22, 2025

Uh oh!

alamb commented Jul 22, 2025

Uh oh!

zhuqi-lucas commented Jul 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jul 23, 2025

Uh oh!

alamb commented Jul 23, 2025

Uh oh!

alamb commented Jul 23, 2025

Uh oh!

alamb commented Jul 23, 2025

Uh oh!

zhuqi-lucas commented Jul 23, 2025

Perf: improve sort via `partition_validity` to use fast path for bit map scan (up to 30% faster) #7962

Perf: improve sort via `partition_validity` to use fast path for bit map scan (up to 30% faster) #7962

zhuqi-lucas commented Jul 18, 2025 •

edited

Loading

zhuqi-lucas Jul 19, 2025 •

edited

Loading

zhuqi-lucas Jul 19, 2025 •

edited

Loading

zhuqi-lucas commented Jul 19, 2025 •

edited

Loading

alamb Jul 23, 2025 •

edited

Loading

alamb commented Jul 23, 2025 •

edited

Loading