-
Notifications
You must be signed in to change notification settings - Fork 1k
Arrow-Parquet SBBF coercion #8551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmark | Sbbf |
ArrowSbbf |
Delta |
---|---|---|---|
i8 |
1.51 ns | 7.38 ns | +5.87 ns |
i32 |
3.86 ns | 7.15 ns | +3.29 ns |
Decimal128(5,2) |
1.73 ns | 7.69 ns | +5.96 ns |
Decimal128(15,2) |
1.73 ns | 8.20 ns | +6.48 ns |
Decimal128(30,2) |
1.73 ns | 5.85 ns | +4.12 ns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this means that casting the bloom filter results is slower?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @mr-brobot -- this is a nice contribution. I left some comments. Let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this means that casting the bloom filter results is slower?
|
||
/// Check if an [AsBytes] value is probably present or definitely absent in the filter | ||
pub fn check<T: AsBytes>(&self, value: &T) -> bool { | ||
pub fn check<T: AsBytes + ?Sized>(&self, value: &T) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed?
Self { sbbf, arrow_type } | ||
} | ||
|
||
/// Check if a value might be present in the bloom filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the expected format of the bytes? It appears to be the arrow representation 🤔
This code looks slightly different than what is in DataFusion. Not sure if that is good/bad 🤔
//! match column_chunk.column_type() { | ||
//! ParquetType::INT32 => { | ||
//! // Date64 was coerced to Date32 - convert milliseconds to days | ||
//! let date32_value = (date64_value / MILLISECONDS_IN_DAY) as i32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you envision a user getting this date32_value
?
I would expect for an Arrow usecase they would have a Date32Array
🤔
I wonder if the API would more cleanly be expressed as an array kernel? Something like
let boolean_array = ArrowSbbf::check(&date32_array)?;
Though I suppose for the common case where there is a single (constant) value this may be overkill
Which issue does this PR close?
Rationale for this change
Parquet types are a subset of Arrow types, so the Arrow writer must coerce to Parquet types. In some cases, this changes the physical representation. Therefore, passing Arrow data directly to
Sbbf::check
will produce false negatives. Correctness is only guaranteed when checking with the coerced Parquet value.This issue affects some integer and decimal types. It can also affect
Date64
.What changes are included in this PR?
Introduces
ArrowSbbf
as an Arrow-aware interface to the ParquetSbbf
. This coerces incoming data if necessary and callsSbbf::check
.Currently,
Date64
types can be written as eitherINT32
(days since epoch) orINT64
(milliseconds since epoch), depending on Arrow writer properties (coerce_types
). Instead of requiring additional information to handle this special (non-default) case, this implementation instructs users to coerceDate64
toDate32
if the Parquet column type isINT32
. I'm open to feedback on this decision.Are these changes tested?
There are tests for integer, float, decimal, and date types. Not exhaustive but covering all cases where coercion is necessary.
Are there any user-facing changes?
There is a new
ArrowSbbf
struct that most Arrow users should prefer over usingSbbf
directly. Also, theSized
constraint was relaxed on theSbbf::check
function to support slices. This is consistent withSbbf::insert
.