Skip to content

Performance issue with range queries over a partitioned table. #811

@ryzhyk

Description

@ryzhyk

I ran into a performance issue querying an Iceberg table in S3 via the datafusion provider. The table was created using pyiceberg with the following schema:

schema = Schema(
    NestedField(1, "id", LongType(), required=True),
    NestedField(2, "name", StringType(), required=False),
    NestedField(3, "b", BooleanType(), required=True),
    NestedField(4, "ts", TimestampType(), required=True),
    NestedField(5, "dt", DateType(), required=True),
)

The table is partitioned by date extracted from the ts column:

partition_spec = PartitionSpec(
    PartitionField(
        source_id=4, field_id=1000, transform=DayTransform(), name="date"
    )
)

There are 10,000,000 records in the table spread evenly across ~200 partitions for dates between 2023-01-01 and 2023-08-02.

I query the table using iceberg-rust via the datafusion table provider using range queries of the form:

select * from my_table where ts >= timestamp '2023-01-05T00:00:00' and ts < timestamp '2023-01-06T00:00:00'

I expect this query to be very efficient, as it only needs to read one partition, however in reality it takes about as long as scanning the entire table with select * from my_table (approximately 10 seconds). It looks like predicate pushdown doesn't work here for some reason.

Questions:

  • Is this a performance issue in iceberg-rust or am I doing something wrong?
  • Is there a better way to perform this query efficiently?

I am using the latest main branch of this repo.

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions