-
Notifications
You must be signed in to change notification settings - Fork 1.1k
fix: bug when struct nullability determined from Dict<_, ByteArray>> column
#8573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| self.def_levels_buffer = self.record_reader.consume_def_levels(); | ||
| self.rep_levels_buffer = self.record_reader.consume_rep_levels(); | ||
| self.record_reader.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick action on this.
During my debugging of the bug, I also came to the conclusion that advancing the rep and buffers would make sense, but I am lacking familiarity with these internals.
The docs for reset state that this should be called after consuming data. This isn't called in the shortcut case now. I am lacking a bit of context here, and the docs of GenericRecordReader aren't helping.
During my debug session, in the error case, num_values() and num_records() were both 0 already at that point. So resetting isn't doing much.
But I am wondering if we can get to a state where num_records > num_values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps reset could also be called before the early exit.
| } | ||
|
|
||
| #[test] | ||
| fn test_read_nullable_structs_with_binary_dict_as_first_child_column() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified that this test fails without the fix in parquet/src/arrow/array_reader/byte_array_dictionary.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @albertlockett 🙏
@valkum can you please verify that this fix fixes the example file you have on
#8404 (comment)
Update: I verfied
The test on #8404 passes with this PR:
warning: `parquet` (lib) generated 10 warnings (run `cargo fix --lib -p parquet` to apply 4 suggestions)
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.05s
Running `target/debug/rust_playground`
Read 1 rows
And fails on main:
thread 'main' panicked at src/main.rs:18:27:
called `Result::unwrap()` on an `Err` value: ParquetError("Parquet error: Failed to decode level data for struct array")
stack backtrace:
0: __rustc::rust_begin_unwind
at /rustc/1159e78c4747b02ef996e55082b704c09b970588/library/std/src/panicking.rs:697:5
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks reasonable to me -- thank you @albertlockett and @valkum
I am not super familiar with this code, so maybe @tustvold could give it a look if he has some time.
|
Just tested this PR with a larger file that caused this error. Works with this fix 👍 |
|
I'll plan to merge this PR in the next day or two unless anyone else would like a chance to review |
| self.def_levels_buffer = self.record_reader.consume_def_levels(); | ||
| self.rep_levels_buffer = self.record_reader.consume_rep_levels(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still trying to understand the issue and the fix. Is the following reasoning correct?
When struct array calls consume_batch on its children, the level buffers are set to the levels for the current batch. A subsequent call to consume_batch sees that num values is 0 and returns an empty array...but the level buffers still contain the levels from the already consumed batch. So this fix here is to ensure that on a subsequent call, the level buffers will be set to the empty ones from the consume_xxx_levels calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @etseidl exactly, that is the correct reasoning
etseidl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @albertlockett!
|
Thanks @albertlockett @valkum @alamb! |
Which issue does this PR close?
Rationale for this change
A regression was reported in issue #8404 which was introduced in #7585. This PR resolves the issue.
What changes are included in this PR?
The root cause of the issue was that the behaviour of
ByteArrayDictionaryReaderis to return a new empty length array of values if the record reader has already been consumed. The problem was that the repetition and definition level buffers were not being advanced in this early return case.arrow-rs/parquet/src/arrow/array_reader/byte_array_dictionary.rs
Lines 167 to 183 in 521f219
The
StructArrayReaderreads the repetition and definition levels from the first child to determine the nullability of the struct array. When we returned the empty values buffer for the child, without advancing the repetition and definition buffers, theStructArrayReadera length mismatch between the empty child array and the non-empty nullability bitmask, and this produces the error.arrow-rs/parquet/src/arrow/array_reader/struct_array.rs
Lines 137 to 170 in 521f219
The fix is simple, always have
ByteArrayDictionaryReaderadvance the repetition and definition level buffers whenconsume_next_batchis called.Are these changes tested?
Yes, a new unit test was added
test_read_nullable_structs_with_binary_dict_as_first_child_column, which before the changes introduced in this PR would replicate the issue.Are there any user-facing changes?
No