Skip to content

Runtime-adaptive data representation #12720

@findepi

Description

@findepi

Originally posted by @andygrove in #11513 (comment)

We are running into the RecordBatches with same logical type but different physical types issue in DataFusion Comet. For a single table, a column may be dictionary-encoded in some Parquet files, and not in others, so we are forced to cast them all to the same type, which introduces unnecessary dictionary encoding (or decoding) overhead.

DataFusion physical planning result mandates particular Arrow type (DataType) for each of the processed columns.
This doesn't reflect reality of modern systems though.

  • source data may be naturally representable in different Arrow types (DataTypes) and forcing single common representation is not efficient
  • adaptive execution of certain operations (like pre aggregation) would benefit from being able to adjust data processing in response to incoming data characteristics observed at runtime

Example 1:
plain table scan reading Parquet files. Same column may be represented differently in individual files (plain array vs RLE/REE vs Dictionary) and it is not optimally efficient to force a particular data layout on the output of the table scan.

Example 2
UNION ALL query may union data from multiple sources, which can naturally produces data in different data types.

Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions