-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
There are several proposals for remedying perceived issues with parquet which generally propose new formats. For example Lance V2 and Nimble
One of the technical challenges raised about Parquet is that the metadata is encoded such that the entire footer must be read and decoded prior to reading any data.
As the numer of columns increases, the argument goes, the size of the parquet metadata increases beyond the ~8MB sweet spot for a single object store request as well as requiring substantial CPU to decode
However, my theory is that the reason that parquet metadata is typically so large for schemas with many columns is the embedded min/max statistical values for columns / pages
Describe the solution you'd like
I would like to gather data on parquet footer metadata size as a function of:
- The number of columns
- The number of row groups
- if Statistics are enabled / disabled
And then report this in a blog with some sort of conclusion about how well parquet can handle large schemas
Bonus points if we can also measure in memory size (though this will of course vary from implementation to implementation)
Describe alternatives you've considered
Additional context
Related discussion with @wesm on twitter: https://twitter.com/wesmckinn/status/1790884370603024826
Cited issue apache/arrow#39676