Skip to content

[Feature] Support generating splits with finer granularity than file level #5012

@Zouxxyy

Description

@Zouxxyy

Motivation

If we have a single Parquet file that is 4GB in size but contains 10 row groups, we can generate 10 splits. This approach will significantly enhance concurrency.

A rough implementation path

  • Expand the DataSplit interface to include support for file-level start offset and length.
  • Then, implement the ability to split files by offset and length to generate corresponding data splits (for instance, Parquet files can be split further by row groups).
  • Finally, ensure that the corresponding readers, such as Parquet, ORC, and Avro readers, support reading rows based on the specified start offset and length + file path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions