refactor: bump crates arrow* and parquet to version 56 #18997

dantengsky · 2025-11-20T09:00:50Z

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

refactor: bump arrow*/parquet to 56

upgrade arrow* + parquet crates to v56
patch iceberg to use arrow 56 as well
refactor: bump arrow to 56 iceberg-rust#4
patch orc-rust to use arrow 56 as well
refactor: bump arrow to version 56 datafuse-extras/orc-rust#1
patch arrow-udf-runtime to use arrow 56 as well
refactor: bump arrow to 56 datafuse-extras/arrow-udf#1
- bump arrow* from 55 to 56
- bump tonic from 0.12 to 0.13
- bump pyo3 from 0.24.1 to 0.25
- bump pyo3-build-config from 0.24 to 0.25
bump pyo3-build-config from 0.24.2 to 0.25
bump pyo3 from 0.24 to 0.25

misc:

Meta data that marks the parquet writer version has been changed to
'parquet-rs version 56.2.0', thus some cases that expect exact parquet
file size need to be tweaked.

Benchmark Performance Comparison Report

Test Scenario

Item	Details
Deployment	Single-node Databend on AWS r7i.8xlarge with disk cache fully cached
Dataset	TPC-H SF1000 `lineitem` table with dictionary encoding enabled
Methodology	Warm one run, record two hot runs, and average their timings
Workload	Scan each column using `select col from lineitem ignore_result`

Executive Summary

arrow55 and arrow56 finish the workload in 60.616s vs 60.665s (0.049s difference, 0.08%), which is within run-to-run noise on this setup, so treat them as a statistical tie.
Per-query winners are split (9 vs 7), and every delta stays below 10%.
No measurable regression observed when upgrading to arrow v56.

Test Configuration Comparison

Setting	arrow55	arrow56
Code base	main branch (arrow v55)	PR branch (arrow v56)
Host:Port	127.0.0.1:3307	127.0.0.1:3307
Database	tpch_1000_dict	tpch_1000_dict
SQL File	col_scan.sql	col_scan.sql
Drop Cache	✗	✗
Warmup	✓ (single warmup run)	✓ (single warmup run)
Hot runs used	2 (averaged)	2 (averaged)

Performance Summary

Metric	arrow55	arrow56
Total Queries	16	16
Successful Queries	16	16
Failed Queries	0	0
Total Time (s)	60.616	60.665
Total Time (m)	1.01	1.01
Average Time (s)	3.789	3.792
Success Rate (%)	100.0	100.0

Times reflect the average of the two hot runs taken after the warmup.

Detailed Query Comparison

Query	arrow55 (s)	arrow56 (s)	Difference
Query 1	2.693	2.764	+0.071s (+2.6%)
Query 2	5.502	5.532	+0.030s (+0.5%)
Query 3	4.751	4.737	-0.014s (-0.3%)
Query 4	1.091	1.093	+0.002s (+0.2%)
Query 5	3.120	3.101	-0.019s (-0.6%)
Query 6	6.169	6.146	-0.023s (-0.4%)
Query 7	3.116	3.063	-0.053s (-1.7%)
Query 8	3.038	2.996	-0.042s (-1.4%)
Query 9	2.768	2.890	+0.122s (+4.4%)
Query 10	2.739	2.781	+0.042s (+1.5%)
Query 11	0.868	0.948	+0.080s (+9.2%)
Query 12	0.877	0.961	+0.084s (+9.6%)
Query 13	0.891	0.964	+0.073s (+8.2%)
Query 14	3.425	3.344	-0.081s (-2.4%)
Query 15	2.721	2.480	-0.241s (-8.9%)
Query 16	16.847	16.866	+0.019s (+0.1%)

Performance Analysis

Overall Performance

arrow55 completes the run in 60.616s while arrow56 records 60.665s, a 0.049s (0.08%) difference that falls within expected measurement noise for this methodology, so treat the builds as statistically tied.
Based on these results there is no measurable regression when moving to arrow v56, so the upgrade is safe to proceed.

Per-query Highlights

arrow55 wins 9 queries, with the largest gaps on Query 12 (+9.6%), Query 11 (+9.2%), and Query 13 (+8.2%).
arrow56 wins 7 queries, with the largest gaps on Query 15 (−8.9%), Query 14 (−2.4%), and Query 7 (−1.7%).
All deltas stay under 10%, and repeating the benchmark may reorder individual query winners.

Appendix: SQL Queries Used

col_scan.sql

-- Column scan queries for lineitem table performance testing
-- Each query scans a single column to test column-wise performance

-- Query 1: l_orderkey
select l_orderkey from lineitem ignore_result;

-- Query 2: l_partkey
select l_partkey from lineitem ignore_result;

-- Query 3: l_suppkey
select l_suppkey from lineitem ignore_result;

-- Query 4: l_linenumber
select l_linenumber from lineitem ignore_result;

-- Query 5: l_quantity
select l_quantity from lineitem ignore_result;

-- Query 6: l_extendedprice
select l_extendedprice from lineitem ignore_result;

-- Query 7: l_discount
select l_discount from lineitem ignore_result;

-- Query 8: l_tax
select l_tax from lineitem ignore_result;

-- Query 9: l_returnflag
select l_returnflag from lineitem ignore_result;

-- Query 10: l_linestatus
select l_linestatus from lineitem ignore_result;

-- Query 11: l_shipdate
select l_shipdate from lineitem ignore_result;

-- Query 12: l_commitdate
select l_commitdate from lineitem ignore_result;

-- Query 13: l_receiptdate
select l_receiptdate from lineitem ignore_result;

-- Query 14: l_shipinstruct
select l_shipinstruct from lineitem ignore_result;

-- Query 15: l_shipmode
select l_shipmode from lineitem ignore_result;

-- Query 16: l_comment
select l_comment from lineitem ignore_result;

fixes: Upgrade arrow*/parquet to ≥56 (where Decimal64 is supported) and bump related dependencies (orc, iceberg, deltalake, …), verifying that performance does not regress. #18992

Tests

Unit Test
Logic Test
Benchmark Test
No Test - covered by existing tests

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Breaking Change (fix or feature that could cause existing functionality not to work as expected)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

This change is

- upgrade arrow* + parquet crates to v56 - patch iceberg to use arrow 56 as well databendlabs/iceberg-rust#4 - patch orc-rust to use arrow 56 as well datafuse-extras/orc-rust#1 - patch arrow-udf-runtime to use arrow 56 as well datafuse-extras/arrow-udf#1 - bump arrow* from 55 to 56 - bump tonic from 0.12 to 0.13 - bump pyo3 from 0.24.1 to 0.25 - bump pyo3-build-config from 0.24 to 0.25 - bump pyo3-build-config from 0.24.2 to 0.25 - bump pyo3 from 0.24 to 0.25

Meta data that marks the parquet writer version has been changed to 'parquet-rs version 56.2.0', thus some cases that expect exact parquet file size need to be tweaked.

dantengsky · 2025-11-21T01:10:00Z

@codex review

chatgpt-codex-connector · 2025-11-21T01:16:21Z

Codex Review: Didn't find any major issues. Already looking forward to the next diff.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label Nov 20, 2025

dantengsky force-pushed the refactor/arrow56 branch from 1903b9b to 0726e79 Compare November 20, 2025 09:12

dantengsky force-pushed the refactor/arrow56 branch from 0726e79 to 718fd06 Compare November 20, 2025 09:34

dantengsky added 4 commits November 20, 2025 18:21

fix: tweak logic tests

7b485ae

Meta data that marks the parquet writer version has been changed to 'parquet-rs version 56.2.0', thus some cases that expect exact parquet file size need to be tweaked.

fix: tweak logic tests

0b448aa

tweak logic tests

a4ae319

fix: tweak logic tests

6e9b848

dantengsky marked this pull request as ready for review November 21, 2025 01:08

dantengsky requested review from SkyFan2002, sundy-li and zhyass November 21, 2025 01:08

SkyFan2002 approved these changes Nov 21, 2025

View reviewed changes

sundy-li approved these changes Nov 21, 2025

View reviewed changes

BohuTANG merged commit 70b0760 into databendlabs:main Nov 21, 2025
174 of 176 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: bump crates arrow* and parquet to version 56 #18997

refactor: bump crates arrow* and parquet to version 56 #18997

Uh oh!

dantengsky commented Nov 20, 2025 •

edited

Loading

Uh oh!

dantengsky commented Nov 21, 2025

Uh oh!

chatgpt-codex-connector bot commented Nov 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

refactor: bump crates arrow* and parquet to version 56 #18997

refactor: bump crates arrow* and parquet to version 56 #18997

Uh oh!

Conversation

dantengsky commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark Performance Comparison Report

Test Scenario

Executive Summary

Test Configuration Comparison

Performance Summary

Detailed Query Comparison

Performance Analysis

Overall Performance

Per-query Highlights

Appendix: SQL Queries Used

col_scan.sql

Tests

Type of change

Uh oh!

dantengsky commented Nov 21, 2025

Uh oh!

chatgpt-codex-connector bot commented Nov 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dantengsky commented Nov 20, 2025 •

edited

Loading