Skip to content

Conversation

@dantengsky
Copy link
Member

@dantengsky dantengsky commented Nov 20, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

refactor: bump arrow*/parquet to 56

misc:

Meta data that marks the parquet writer version has been changed to
'parquet-rs version 56.2.0', thus some cases that expect exact parquet
file size need to be tweaked.

Benchmark Performance Comparison Report

Test Scenario

Item Details
Deployment Single-node Databend on AWS r7i.8xlarge with disk cache fully cached
Dataset TPC-H SF1000 lineitem table with dictionary encoding enabled
Methodology Warm one run, record two hot runs, and average their timings
Workload Scan each column using select col from lineitem ignore_result

Executive Summary

  • arrow55 and arrow56 finish the workload in 60.616s vs 60.665s (0.049s difference, 0.08%), which is within run-to-run noise on this setup, so treat them as a statistical tie.
  • Per-query winners are split (9 vs 7), and every delta stays below 10%.
  • No measurable regression observed when upgrading to arrow v56.

Test Configuration Comparison

Setting arrow55 arrow56
Code base main branch (arrow v55) PR branch (arrow v56)
Host:Port 127.0.0.1:3307 127.0.0.1:3307
Database tpch_1000_dict tpch_1000_dict
SQL File col_scan.sql col_scan.sql
Drop Cache
Warmup ✓ (single warmup run) ✓ (single warmup run)
Hot runs used 2 (averaged) 2 (averaged)

Performance Summary

Metric arrow55 arrow56
Total Queries 16 16
Successful Queries 16 16
Failed Queries 0 0
Total Time (s) 60.616 60.665
Total Time (m) 1.01 1.01
Average Time (s) 3.789 3.792
Success Rate (%) 100.0 100.0

Times reflect the average of the two hot runs taken after the warmup.

Detailed Query Comparison

Query arrow55 (s) arrow56 (s) Difference
Query 1 2.693 2.764 +0.071s (+2.6%)
Query 2 5.502 5.532 +0.030s (+0.5%)
Query 3 4.751 4.737 -0.014s (-0.3%)
Query 4 1.091 1.093 +0.002s (+0.2%)
Query 5 3.120 3.101 -0.019s (-0.6%)
Query 6 6.169 6.146 -0.023s (-0.4%)
Query 7 3.116 3.063 -0.053s (-1.7%)
Query 8 3.038 2.996 -0.042s (-1.4%)
Query 9 2.768 2.890 +0.122s (+4.4%)
Query 10 2.739 2.781 +0.042s (+1.5%)
Query 11 0.868 0.948 +0.080s (+9.2%)
Query 12 0.877 0.961 +0.084s (+9.6%)
Query 13 0.891 0.964 +0.073s (+8.2%)
Query 14 3.425 3.344 -0.081s (-2.4%)
Query 15 2.721 2.480 -0.241s (-8.9%)
Query 16 16.847 16.866 +0.019s (+0.1%)

Performance Analysis

Overall Performance

  • arrow55 completes the run in 60.616s while arrow56 records 60.665s, a 0.049s (0.08%) difference that falls within expected measurement noise for this methodology, so treat the builds as statistically tied.
  • Based on these results there is no measurable regression when moving to arrow v56, so the upgrade is safe to proceed.

Per-query Highlights

  • arrow55 wins 9 queries, with the largest gaps on Query 12 (+9.6%), Query 11 (+9.2%), and Query 13 (+8.2%).
  • arrow56 wins 7 queries, with the largest gaps on Query 15 (−8.9%), Query 14 (−2.4%), and Query 7 (−1.7%).
  • All deltas stay under 10%, and repeating the benchmark may reorder individual query winners.

Appendix: SQL Queries Used

col_scan.sql

-- Column scan queries for lineitem table performance testing
-- Each query scans a single column to test column-wise performance

-- Query 1: l_orderkey
select l_orderkey from lineitem ignore_result;

-- Query 2: l_partkey
select l_partkey from lineitem ignore_result;

-- Query 3: l_suppkey
select l_suppkey from lineitem ignore_result;

-- Query 4: l_linenumber
select l_linenumber from lineitem ignore_result;

-- Query 5: l_quantity
select l_quantity from lineitem ignore_result;

-- Query 6: l_extendedprice
select l_extendedprice from lineitem ignore_result;

-- Query 7: l_discount
select l_discount from lineitem ignore_result;

-- Query 8: l_tax
select l_tax from lineitem ignore_result;

-- Query 9: l_returnflag
select l_returnflag from lineitem ignore_result;

-- Query 10: l_linestatus
select l_linestatus from lineitem ignore_result;

-- Query 11: l_shipdate
select l_shipdate from lineitem ignore_result;

-- Query 12: l_commitdate
select l_commitdate from lineitem ignore_result;

-- Query 13: l_receiptdate
select l_receiptdate from lineitem ignore_result;

-- Query 14: l_shipinstruct
select l_shipinstruct from lineitem ignore_result;

-- Query 15: l_shipmode
select l_shipmode from lineitem ignore_result;

-- Query 16: l_comment
select l_comment from lineitem ignore_result;

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - covered by existing tests

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label Nov 20, 2025
- upgrade arrow* + parquet crates to v56
- patch iceberg to use arrow 56 as well
  databendlabs/iceberg-rust#4
- patch orc-rust to use arrow 56 as well
  datafuse-extras/orc-rust#1
- patch arrow-udf-runtime to use arrow 56 as well
  datafuse-extras/arrow-udf#1
  - bump arrow* from 55 to 56
  - bump tonic from 0.12 to 0.13
  - bump pyo3 from 0.24.1 to 0.25
  - bump pyo3-build-config from 0.24 to 0.25
- bump pyo3-build-config from 0.24.2 to 0.25
- bump pyo3 from 0.24 to 0.25
Meta data that marks the parquet writer version has been changed to
'parquet-rs version 56.2.0', thus some cases that expect exact parquet
file size need to be tweaked.
@dantengsky dantengsky marked this pull request as ready for review November 21, 2025 01:08
@dantengsky
Copy link
Member Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Already looking forward to the next diff.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@BohuTANG BohuTANG merged commit 70b0760 into databendlabs:main Nov 21, 2025
174 of 176 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-refactor this PR changes the code base without new features or bugfix

Projects

None yet

4 participants