[thrift-remodel] Use new Thrift encoder/decoder for Parquet page headers #8376

etseidl · 2025-09-17T19:44:10Z

Which issue does this PR close?

Note: this targets a feature branch, not main

Part of Use custom thrift decoder to improve speed of parsing parquet metadata #5854.

Rationale for this change

This continues the remodel, moving on to PageHeader support.

What changes are included in this PR?

Swaps out old format page header structs for new ones. This also adds a Read based implementation of the thrift compact protocol reader (the sizes of the Thrift encoded page headers are not knowable in advance, so we need a way to read them from the thrift input stream used by the page decoder).

This PR also makes decoding of the Statistics in the page header optional (defaults to false). We do not use them, and the decode takes a good chunk of time.

Are these changes tested?

These changes should be covered by existing tests

Are there any user-facing changes?

Yes, page level stats are no longer decoded by default

etseidl · 2025-09-17T19:51:41Z

parquet/src/parquet_macros.rs

    }
 }

+/// only implements ReadThrift for the give IDL struct definition


These macros will eventually go away...they were an experiment

alamb · 2025-09-19T20:37:12Z

Yes, page level stats are no longer decoded by default

This is likely huge

alamb

The fact that all the tests pass is pretty amazing and a good vote of confidence. Thank you @etseidl

alamb · 2025-09-19T20:38:42Z

parquet/src/column/page.rs

                    repetition_levels_byte_length: rep_levels_byte_len as i32,
                    is_compressed: Some(is_compressed),
-                    statistics: crate::file::statistics::to_thrift(statistics.as_ref()),
+                    statistics: page_stats_to_thrift(statistics.as_ref()),


I personally find this much easier to understand / read now -- and it is nice to see us being able to avoid all the into()

alamb · 2025-09-19T20:39:56Z

parquet/src/file/metadata/thrift_gen.rs

 );

+// expose for benchmarking
+pub(crate) fn bench_file_metadata(bytes: &bytes::Bytes) {


Why can't it be in the benchmark function itself? Maybe we should just benchmark end to end metadata decoding?

FileMetaData is private to this module, so I added this function. I like seeing how much of the end-to-end time is from the actual thrift decoding vs the time spent turning that into the final metadata objects. We can remove once the remodel is complete.

alamb · 2025-09-19T20:41:44Z

parquet/src/file/metadata/thrift_gen.rs

+);
+
+// Statistics for the page header. This is separate because of the differing lifetime requirements
+// for page handling vs column chunk. Once we start writing column chunks this might need to be


I don't understand this comment -- page statistics are part of the PageIndex, right? Or maybe I have my structures confused

There is a thrift Statistics field on both the column metadata and the page header. For the former I can use the Statistics<'a> struct which uses slices for the min/max fields. The page header reader cannot use slices, so I need the same struct but with vecs for the min/max. I can try to make this explanation clearer.

Thankfully we can now skip reading this field altogether and not incur the allocation cost.

alamb · 2025-09-19T20:42:00Z

parquet/src/file/metadata/thrift_gen.rs

+);
+
+impl DataPageHeader {
+    // reader that skips decoding page statistics


etseidl · 2025-09-20T06:36:35Z

The fact that all the tests pass is pretty amazing and a good vote of confidence.

Thanks for the review @alamb. I have been pleasantly surprised by how smoothly this has been going.

alamb · 2025-09-20T09:41:45Z

parquet/src/file/serialized_reader.rs

    ) -> Result<PageHeader> {
-        let mut prot = TCompactInputProtocol::new(input);
-        Ok(PageHeader::read_from_in_protocol(&mut prot)?)
+        let mut prot = ThriftReadInputProtocol::new(input);


Something I was thinking about last night was "how would we implement only decoding statistics / metadata for a subset of columns and/or Row Groups"

This PR plumbs the flag for reading page statistics down, but I wonder if it would make sense to start collecting the decoder functions into a struct

pub struct ParquetThriftDecoder { read_page_stats: bool, // which columns to read detailed statistics for read_column_statistics: Vec<bool>, // .... }

It seems like SerializedPageReaderContext is kind of fills this roll, but it only applies to a subset of encoding 🤔

I was thinking we could short-circuit the footer parsing and exit right after decoding the schema. With that in hand, we could then jump back in, skipping the schema and then we'd be able to skip over row groups or columns that we don't want. This would still incur the some of the thrift overhead, but skipping objects is quite a bit faster than decoding them.

I know I've seen this idea kicked around before, but we could also do a fast indexing pass over the metadata where we save the starting offsets of each row group and column chunk. We could then just do random access into the footer and decode only those structs we need.

etseidl added 30 commits August 20, 2025 12:28

custom PageLocation decoder for speed

e3a0b50

fix recently added test

71d3859

clippy

ff42e5a

experimental new form for column index

1f2c216

fix for test added in main

37f3b20

refactor new column index

3d4e28e

checkpoint...everything but stats converter

2b85b89

fix bug found in testing

5ee1b8f

Merge branch 'new_col_idx' into new_col_idx_full

624b88b

stats converter works

d99a06a

get rid of import

79a6917

get parquet-index working

878d460

doc fixes

009632a

Merge branch 'offset_idx_speedup' into new_col_idx_full

998ac6c

move column index to its own module

a822dfd

add ColumnIndexIterators trait, simplify stats converter a little

20df075

restore comment

7755b7b

Merge branch 'new_col_idx' into new_col_idx_full

66ed8bc

further rework...allow for fallback to slow decoder

f6c5738

Merge branch 'offset_idx_speedup' into new_col_idx_full

3733b86

refactor a bit

09d71e1

simplify reading of int array

1ddaa35

Merge branch 'offset_idx_speedup' into new_col_idx_full

006d59d

get write working for enum and some unions

c271085

make test_roundtrip visible

34cdaf2

add test for converted_type, start on logical_type

c9be570

checkpoint struct field writing

a9cd09d

get some struct examples and lists working

ae65167

get rid of copied allow

272a013

get writer macros for structs working

632e171

etseidl added 15 commits September 10, 2025 14:46

Merge branch 'write_thrift' into read_and_crypto

8305915

Merge branch 'read_and_crypto' into rework_thrift_reader

4221646

Merge branch 'rework_thrift_reader' into read_page_header

4342cb5

Merge branch 'gh5854_thrift_remodel' into read_and_crypto

f0beb0b

Merge branch 'read_and_crypto' into rework_thrift_reader

b303e52

Merge branch 'rework_thrift_reader' into read_page_header

2955b85

backport fix for tests without encryption

cfa6740

Merge branch 'read_and_crypto' into rework_thrift_reader

6c82028

Merge branch 'rework_thrift_reader' into read_page_header

b16e118

add documentation

82f31a4

Merge branch 'rework_thrift_reader' into read_page_header

608c0f3

add docs for ThriftReadInputProtocol

237ca3d

Merge branch 'gh5854_thrift_remodel' into rework_thrift_reader

4da5d9e

Merge branch 'rework_thrift_reader' into read_page_header

afb4adf

Merge branch 'gh5854_thrift_remodel' into read_page_header

ebae0af

github-actions bot added the parquet Changes to the parquet crate label Sep 17, 2025

fix typo

7560e70

etseidl commented Sep 17, 2025

View reviewed changes

etseidl added 2 commits September 17, 2025 12:53

fix typo

e94a2de

clean up some imports

56a75d6

etseidl added the api-change Changes to the arrow API label Sep 17, 2025

mbrobbel added this to the 57.0.0 milestone Sep 18, 2025

alamb approved these changes Sep 19, 2025

View reviewed changes

alamb reviewed Sep 20, 2025

View reviewed changes

update docs for PageStatistics

7b549f9

etseidl merged commit 3dbd42e into apache:gh5854_thrift_remodel Sep 23, 2025
16 checks passed

etseidl deleted the read_page_header branch October 10, 2025 14:35

etseidl mentioned this pull request Oct 31, 2025

POC: Avoid a copy for uncompressed pages #8756

Draft

[thrift-remodel] Use new Thrift encoder/decoder for Parquet page headers #8376

[thrift-remodel] Use new Thrift encoder/decoder for Parquet page headers #8376

Uh oh!

Conversation

etseidl commented Sep 17, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Sep 19, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etseidl commented Sep 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants