Skip to content

Conversation

@etseidl
Copy link
Contributor

@etseidl etseidl commented Sep 17, 2025

Which issue does this PR close?

Note: this targets a feature branch, not main

Rationale for this change

This continues the remodel, moving on to PageHeader support.

What changes are included in this PR?

Swaps out old format page header structs for new ones. This also adds a Read based implementation of the thrift compact protocol reader (the sizes of the Thrift encoded page headers are not knowable in advance, so we need a way to read them from the thrift input stream used by the page decoder).

This PR also makes decoding of the Statistics in the page header optional (defaults to false). We do not use them, and the decode takes a good chunk of time.

Are these changes tested?

These changes should be covered by existing tests

Are there any user-facing changes?

Yes, page level stats are no longer decoded by default

@github-actions github-actions bot added the parquet Changes to the parquet crate label Sep 17, 2025
}
}

/// only implements ReadThrift for the give IDL struct definition
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These macros will eventually go away...they were an experiment

@etseidl etseidl added the api-change Changes to the arrow API label Sep 17, 2025
@mbrobbel mbrobbel added this to the 57.0.0 milestone Sep 18, 2025
@alamb
Copy link
Contributor

alamb commented Sep 19, 2025

Yes, page level stats are no longer decoded by default

This is likely huge

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that all the tests pass is pretty amazing and a good vote of confidence. Thank you @etseidl

repetition_levels_byte_length: rep_levels_byte_len as i32,
is_compressed: Some(is_compressed),
statistics: crate::file::statistics::to_thrift(statistics.as_ref()),
statistics: page_stats_to_thrift(statistics.as_ref()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally find this much easier to understand / read now -- and it is nice to see us being able to avoid all the into()

);

// expose for benchmarking
pub(crate) fn bench_file_metadata(bytes: &bytes::Bytes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't it be in the benchmark function itself? Maybe we should just benchmark end to end metadata decoding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileMetaData is private to this module, so I added this function. I like seeing how much of the end-to-end time is from the actual thrift decoding vs the time spent turning that into the final metadata objects. We can remove once the remodel is complete.

);

// Statistics for the page header. This is separate because of the differing lifetime requirements
// for page handling vs column chunk. Once we start writing column chunks this might need to be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment -- page statistics are part of the PageIndex, right? Or maybe I have my structures confused

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a thrift Statistics field on both the column metadata and the page header. For the former I can use the Statistics<'a> struct which uses slices for the min/max fields. The page header reader cannot use slices, so I need the same struct but with vecs for the min/max. I can try to make this explanation clearer.

Thankfully we can now skip reading this field altogether and not incur the allocation cost.

);

impl DataPageHeader {
// reader that skips decoding page statistics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@etseidl
Copy link
Contributor Author

etseidl commented Sep 20, 2025

The fact that all the tests pass is pretty amazing and a good vote of confidence.

Thanks for the review @alamb. I have been pleasantly surprised by how smoothly this has been going.

) -> Result<PageHeader> {
let mut prot = TCompactInputProtocol::new(input);
Ok(PageHeader::read_from_in_protocol(&mut prot)?)
let mut prot = ThriftReadInputProtocol::new(input);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I was thinking about last night was "how would we implement only decoding statistics / metadata for a subset of columns and/or Row Groups"

This PR plumbs the flag for reading page statistics down, but I wonder if it would make sense to start collecting the decoder functions into a struct

pub struct ParquetThriftDecoder { 
  read_page_stats: bool,
  // which columns to read detailed statistics for
  read_column_statistics: Vec<bool>,
  // ....
}

It seems like SerializedPageReaderContext is kind of fills this roll, but it only applies to a subset of encoding 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we could short-circuit the footer parsing and exit right after decoding the schema. With that in hand, we could then jump back in, skipping the schema and then we'd be able to skip over row groups or columns that we don't want. This would still incur the some of the thrift overhead, but skipping objects is quite a bit faster than decoding them.

I know I've seen this idea kicked around before, but we could also do a fast indexing pass over the metadata where we save the starting offsets of each row group and column chunk. We could then just do random access into the footer and decode only those structs we need.

@etseidl etseidl merged commit 3dbd42e into apache:gh5854_thrift_remodel Sep 23, 2025
16 checks passed
@etseidl etseidl deleted the read_page_header branch October 10, 2025 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants