Motivation
If we have a single Parquet file that is 4GB in size but contains 10 row groups, we can generate 10 splits. This approach will significantly enhance concurrency.
A rough implementation path
- Expand the
DataSplit interface to include support for file-level start offset and length.
- Then, implement the ability to split files by offset and length to generate corresponding data splits (for instance, Parquet files can be split further by row groups).
- Finally, ensure that the corresponding readers, such as Parquet, ORC, and Avro readers, support reading rows based on the specified start offset and length + file path.