Skip to content

Use custom thrift decoder to improve speed of parsing parquet metadata #5854

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Part of #5853

Parsing the parquet metadata takes substantial time and most of that time is spent in decoding the thrift format (@XiangpengHao is quantifying this in #5770)

Describe the solution you'd like
Improve the thrift decoder speed

Describe alternatives you've considered
@jhorstmann reports on #5775 that he made a prototype of this:

          I had an attack of "not invented here" syndrome the last few days 😅 and worked on an alternative code generator for thrift, that would allow me to more easily try out some changes to the generated code. The repo can be found at <https://github.com/jhorstmann/compact-thrift/> and the output for `parquet.thrift` at <https://github.com/jhorstmann/compact-thrift/blob/main/src/main/rust/tests/parquet.rs>.

The current output is still doing allocations for string and binary, but running the benchmarks from https://github.com/tustvold/arrow-rs/tree/thrift-bench shows some nice improvements. This is the comparison with current arrow-rs code, so both versions should be doing the same amount of allocations:

decode metadata      time:   [32.592 ms 32.645 ms 32.702 ms]

decode metadata new  time:   [17.440 ms 17.476 ms 17.532 ms]

So incidentally very close to that 2x improvement.

The main difference in the code should be avoiding most of the abstractions from TInputProtocol and avoiding stack moves by directly writing into default-initialized structs instead of moving from local variables.

Originally posted by @jhorstmann in #5775 (comment)

Additional context

Metadata

Metadata

Assignees

Labels

enhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions