Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
There are several proposals for remedying perceived issues with parquet which generally propose new formats. For example [Lance V2](https://blog.lancedb.com/lance-v2/) and [Nimble](https://github.com/facebookincubator/nimble)

One of the technical challenges raised about Parquet is that the [metadata](https://docs.rs/parquet/latest/parquet/file/metadata/index.html)  is encoded such that the entire footer must be read and decoded prior to reading any data. 

As the numer of columns increases, the argument goes, the size of the parquet metadata increases beyond the ~8MB sweet spot for a single object store request as well as requiring substantial CPU to decode

However, my theory is that the reason that parquet metadata is typically so large for schemas with many columns is the embedded min/max [statistical](https://docs.rs/parquet/latest/parquet/file/metadata/struct.ColumnChunkMetaData.html#method.statistics) values for columns / pages


**Describe the solution you'd like**
I would like to gather data on parquet footer metadata size as a function of:
1. The number of columns
2. The number of row groups
3. if Statistics are enabled / disabled

And then report this in a blog with some sort of conclusion about how well parquet can handle large schemas

Bonus points if we can also measure in memory size (though this will of course vary from implementation to implementation)

**Describe alternatives you've considered**


**Additional context**
Related discussion with @wesm on twitter: https://twitter.com/wesmckinn/status/1790884370603024826

Cited issue https://github.com/apache/arrow/issues/39676


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns #5770

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns #5770

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions