-
Notifications
You must be signed in to change notification settings - Fork 1k
[thrift-remodel] Use new Thrift encoder/decoder for Parquet page headers #8376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[thrift-remodel] Use new Thrift encoder/decoder for Parquet page headers #8376
Conversation
| } | ||
| } | ||
|
|
||
| /// only implements ReadThrift for the give IDL struct definition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These macros will eventually go away...they were an experiment
This is likely huge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that all the tests pass is pretty amazing and a good vote of confidence. Thank you @etseidl
| repetition_levels_byte_length: rep_levels_byte_len as i32, | ||
| is_compressed: Some(is_compressed), | ||
| statistics: crate::file::statistics::to_thrift(statistics.as_ref()), | ||
| statistics: page_stats_to_thrift(statistics.as_ref()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally find this much easier to understand / read now -- and it is nice to see us being able to avoid all the into()
| ); | ||
|
|
||
| // expose for benchmarking | ||
| pub(crate) fn bench_file_metadata(bytes: &bytes::Bytes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't it be in the benchmark function itself? Maybe we should just benchmark end to end metadata decoding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FileMetaData is private to this module, so I added this function. I like seeing how much of the end-to-end time is from the actual thrift decoding vs the time spent turning that into the final metadata objects. We can remove once the remodel is complete.
| ); | ||
|
|
||
| // Statistics for the page header. This is separate because of the differing lifetime requirements | ||
| // for page handling vs column chunk. Once we start writing column chunks this might need to be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment -- page statistics are part of the PageIndex, right? Or maybe I have my structures confused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a thrift Statistics field on both the column metadata and the page header. For the former I can use the Statistics<'a> struct which uses slices for the min/max fields. The page header reader cannot use slices, so I need the same struct but with vecs for the min/max. I can try to make this explanation clearer.
Thankfully we can now skip reading this field altogether and not incur the allocation cost.
| ); | ||
|
|
||
| impl DataPageHeader { | ||
| // reader that skips decoding page statistics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
Thanks for the review @alamb. I have been pleasantly surprised by how smoothly this has been going. |
| ) -> Result<PageHeader> { | ||
| let mut prot = TCompactInputProtocol::new(input); | ||
| Ok(PageHeader::read_from_in_protocol(&mut prot)?) | ||
| let mut prot = ThriftReadInputProtocol::new(input); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I was thinking about last night was "how would we implement only decoding statistics / metadata for a subset of columns and/or Row Groups"
This PR plumbs the flag for reading page statistics down, but I wonder if it would make sense to start collecting the decoder functions into a struct
pub struct ParquetThriftDecoder {
read_page_stats: bool,
// which columns to read detailed statistics for
read_column_statistics: Vec<bool>,
// ....
}It seems like SerializedPageReaderContext is kind of fills this roll, but it only applies to a subset of encoding 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we could short-circuit the footer parsing and exit right after decoding the schema. With that in hand, we could then jump back in, skipping the schema and then we'd be able to skip over row groups or columns that we don't want. This would still incur the some of the thrift overhead, but skipping objects is quite a bit faster than decoding them.
I know I've seen this idea kicked around before, but we could also do a fast indexing pass over the metadata where we save the starting offsets of each row group and column chunk. We could then just do random access into the footer and decode only those structs we need.
Which issue does this PR close?
Note: this targets a feature branch, not main
Rationale for this change
This continues the remodel, moving on to
PageHeadersupport.What changes are included in this PR?
Swaps out old
formatpage header structs for new ones. This also adds aReadbased implementation of the thrift compact protocol reader (the sizes of the Thrift encoded page headers are not knowable in advance, so we need a way to read them from the thrift input stream used by the page decoder).This PR also makes decoding of the
Statisticsin the page header optional (defaults tofalse). We do not use them, and the decode takes a good chunk of time.Are these changes tested?
These changes should be covered by existing tests
Are there any user-facing changes?
Yes, page level stats are no longer decoded by default