Skip to content

Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns #5770

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
There are several proposals for remedying perceived issues with parquet which generally propose new formats. For example Lance V2 and Nimble

One of the technical challenges raised about Parquet is that the metadata is encoded such that the entire footer must be read and decoded prior to reading any data.

As the numer of columns increases, the argument goes, the size of the parquet metadata increases beyond the ~8MB sweet spot for a single object store request as well as requiring substantial CPU to decode

However, my theory is that the reason that parquet metadata is typically so large for schemas with many columns is the embedded min/max statistical values for columns / pages

Describe the solution you'd like
I would like to gather data on parquet footer metadata size as a function of:

  1. The number of columns
  2. The number of row groups
  3. if Statistics are enabled / disabled

And then report this in a blog with some sort of conclusion about how well parquet can handle large schemas

Bonus points if we can also measure in memory size (though this will of course vary from implementation to implementation)

Describe alternatives you've considered

Additional context
Related discussion with @wesm on twitter: https://twitter.com/wesmckinn/status/1790884370603024826

Cited issue apache/arrow#39676

Metadata

Metadata

Assignees

Labels

arrowChanges to the arrow cratedocumentationImprovements or additions to documentationenhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions