-
Notifications
You must be signed in to change notification settings - Fork 354
Description
I ran into a performance issue querying an Iceberg table in S3 via the datafusion provider. The table was created using pyiceberg with the following schema:
schema = Schema(
NestedField(1, "id", LongType(), required=True),
NestedField(2, "name", StringType(), required=False),
NestedField(3, "b", BooleanType(), required=True),
NestedField(4, "ts", TimestampType(), required=True),
NestedField(5, "dt", DateType(), required=True),
)The table is partitioned by date extracted from the ts column:
partition_spec = PartitionSpec(
PartitionField(
source_id=4, field_id=1000, transform=DayTransform(), name="date"
)
)There are 10,000,000 records in the table spread evenly across ~200 partitions for dates between 2023-01-01 and 2023-08-02.
I query the table using iceberg-rust via the datafusion table provider using range queries of the form:
select * from my_table where ts >= timestamp '2023-01-05T00:00:00' and ts < timestamp '2023-01-06T00:00:00'I expect this query to be very efficient, as it only needs to read one partition, however in reality it takes about as long as scanning the entire table with select * from my_table (approximately 10 seconds). It looks like predicate pushdown doesn't work here for some reason.
Questions:
- Is this a performance issue in
iceberg-rustor am I doing something wrong? - Is there a better way to perform this query efficiently?
I am using the latest main branch of this repo.
Thanks in advance!