Skip to content

Conversation

mr-brobot
Copy link
Contributor

@mr-brobot mr-brobot commented Oct 4, 2025

Which issue does this PR close?

Rationale for this change

Parquet types are a subset of Arrow types, so the Arrow writer must coerce to Parquet types. In some cases, this changes the physical representation. Therefore, passing Arrow data directly to Sbbf::check will produce false negatives. Correctness is only guaranteed when checking with the coerced Parquet value.

This issue affects some integer and decimal types. It can also affect Date64.

What changes are included in this PR?

Introduces ArrowSbbf as an Arrow-aware interface to the Parquet Sbbf. This coerces incoming data if necessary and calls Sbbf::check.

Currently, Date64 types can be written as either INT32 (days since epoch) or INT64 (milliseconds since epoch), depending on Arrow writer properties (coerce_types). Instead of requiring additional information to handle this special (non-default) case, this implementation instructs users to coerce Date64 to Date32 if the Parquet column type is INT32. I'm open to feedback on this decision.

Are these changes tested?

There are tests for integer, float, decimal, and date types. Not exhaustive but covering all cases where coercion is necessary.

Are there any user-facing changes?

There is a new ArrowSbbf struct that most Arrow users should prefer over using Sbbf directly. Also, the Sized constraint was relaxed on the Sbbf::check function to support slices. This is consistent with Sbbf::insert.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 4, 2025
Copy link
Contributor Author

@mr-brobot mr-brobot Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark Sbbf ArrowSbbf Delta
i8 1.51 ns 7.38 ns +5.87 ns
i32 3.86 ns 7.15 ns +3.29 ns
Decimal128(5,2) 1.73 ns 7.69 ns +5.96 ns
Decimal128(15,2) 1.73 ns 8.20 ns +6.48 ns
Decimal128(30,2) 1.73 ns 5.85 ns +4.12 ns

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this means that casting the bloom filter results is slower?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mr-brobot -- this is a nice contribution. I left some comments. Let me know what you think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this means that casting the bloom filter results is slower?


/// Check if an [AsBytes] value is probably present or definitely absent in the filter
pub fn check<T: AsBytes>(&self, value: &T) -> bool {
pub fn check<T: AsBytes + ?Sized>(&self, value: &T) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Self { sbbf, arrow_type }
}

/// Check if a value might be present in the bloom filter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the expected format of the bytes? It appears to be the arrow representation 🤔

This code looks slightly different than what is in DataFusion. Not sure if that is good/bad 🤔

https://github.com/apache/datafusion/blob/522403bb44780679109055abca6048d21add0d25/datafusion/datasource-parquet/src/row_group_filter.rs#L239-L298

//! match column_chunk.column_type() {
//! ParquetType::INT32 => {
//! // Date64 was coerced to Date32 - convert milliseconds to days
//! let date32_value = (date64_value / MILLISECONDS_IN_DAY) as i32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you envision a user getting this date32_value?

I would expect for an Arrow usecase they would have a Date32Array 🤔

I wonder if the API would more cleanly be expressed as an array kernel? Something like

let boolean_array = ArrowSbbf::check(&date32_array)?;

Though I suppose for the common case where there is a single (constant) value this may be overkill

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bloom filters for i8 and i16 always return false negatives

2 participants