Skip to content

Commit e8384fb

Browse files
authored
chore: Add TPCDS benchmarks (#19138)
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #. This is PR instead of accidentally rebased #18985 Also - fixing Q30 as there is a reference to non existent column - adding TPCH, TPCDS scripts to compare results between branches and documentation ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
1 parent 4c3e3c1 commit e8384fb

File tree

11 files changed

+618
-18
lines changed

11 files changed

+618
-18
lines changed

benchmarks/README.md

Lines changed: 58 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -119,7 +119,6 @@ You can also invoke the helper directly if you need to customise arguments furth
119119
./benchmarks/compile_profile.py --profiles dev release --data /path/to/tpch_sf1
120120
```
121121

122-
123122
## Benchmark with modified configurations
124123

125124
### Select join algorithm
@@ -147,6 +146,19 @@ To verify that datafusion picked up your configuration, run the benchmarks with
147146

148147
## Comparing performance of main and a branch
149148

149+
For TPC-H
150+
```shell
151+
./benchmarks/compare_tpch.sh main mybranch
152+
```
153+
154+
For TPC-DS.
155+
To get data in `DATA_DIR` for TPCDS, please follow instructions in `./benchmarks/bench.sh data tcpds`
156+
```shell
157+
DATA_DIR=../../datafusion-benchmarks/tpcds/data/sf1/ ./benchmarks/compare_tpcds.sh main mybranch
158+
```
159+
160+
Alternatively you can compare manually followng the example velor
161+
150162
```shell
151163
git checkout main
152164

@@ -299,7 +311,6 @@ This will produce output like:
299311
└──────────────┴──────────────┴──────────────┴───────────────┘
300312
```
301313

302-
303314
# Benchmark Runner
304315

305316
The `dfbench` program contains subcommands to run the various
@@ -339,24 +350,28 @@ FLAGS:
339350
```
340351

341352
# Profiling Memory Stats for each benchmark query
353+
342354
The `mem_profile` program wraps benchmark execution to measure memory usage statistics, such as peak RSS. It runs each benchmark query in a separate subprocess, capturing the child process’s stdout to print structured output.
343355

344356
Subcommands supported by mem_profile are the subset of those in `dfbench`.
345-
Currently supported benchmarks include: Clickbench, H2o, Imdb, SortTpch, Tpch
357+
Currently supported benchmarks include: Clickbench, H2o, Imdb, SortTpch, Tpch, TPCDS
346358

347359
Before running benchmarks, `mem_profile` automatically compiles the benchmark binary (`dfbench`) using `cargo build`. Note that the build profile used for `dfbench` is not tied to the profile used for running `mem_profile` itself. We can explicitly specify the desired build profile using the `--bench-profile` option (e.g. release-nonlto). By prebuilding the binary and running each query in a separate process, we can ensure accurate memory statistics.
348360

349361
Currently, `mem_profile` only supports `mimalloc` as the memory allocator, since it relies on `mimalloc`'s API to collect memory statistics.
350362

351-
Because it runs the compiled binary directly from the target directory, make sure your working directory is the top-level datafusion/ directory, where the target/ is also located.
363+
Because it runs the compiled binary directly from the target directory, make sure your working directory is the top-level datafusion/ directory, where the target/ is also located.
364+
365+
The benchmark subcommand (e.g., `tpch`) and all following arguments are passed directly to `dfbench`. Be sure to specify `--bench-profile` before the benchmark subcommand.
352366

353-
The benchmark subcommand (e.g., `tpch`) and all following arguments are passed directly to `dfbench`. Be sure to specify `--bench-profile` before the benchmark subcommand.
367+
Example:
354368

355-
Example:
356369
```shell
357370
datafusion$ cargo run --profile release-nonlto --bin mem_profile -- --bench-profile release-nonlto tpch --path benchmarks/data/tpch_sf1 --partitions 4 --format parquet
358371
```
372+
359373
Example Output:
374+
360375
```
361376
Query Time (ms) Peak RSS Peak Commit Major Page Faults
362377
----------------------------------------------------------------
@@ -385,19 +400,21 @@ Query Time (ms) Peak RSS Peak Commit Major Page Faults
385400
```
386401

387402
## Reported Metrics
403+
388404
When running benchmarks, `mem_profile` collects several memory-related statistics using the mimalloc API:
389405

390-
- Peak RSS (Resident Set Size):
391-
The maximum amount of physical memory used by the process.
392-
This is a process-level metric collected via OS-specific mechanisms and is not mimalloc-specific.
406+
- Peak RSS (Resident Set Size):
407+
The maximum amount of physical memory used by the process.
408+
This is a process-level metric collected via OS-specific mechanisms and is not mimalloc-specific.
393409

394410
- Peak Commit:
395-
The peak amount of memory committed by the allocator (i.e., total virtual memory reserved).
396-
This is mimalloc-specific. It gives a more allocator-aware view of memory usage than RSS.
411+
The peak amount of memory committed by the allocator (i.e., total virtual memory reserved).
412+
This is mimalloc-specific. It gives a more allocator-aware view of memory usage than RSS.
397413

398414
- Major Page Faults:
399-
The number of major page faults triggered during execution.
400-
This metric is obtained from the operating system and is not mimalloc-specific.
415+
The number of major page faults triggered during execution.
416+
This metric is obtained from the operating system and is not mimalloc-specific.
417+
401418
# Writing a new benchmark
402419

403420
## Creating or downloading data outside of the benchmark
@@ -586,6 +603,34 @@ This benchmarks is derived from the [TPC-H][1] version
586603
[2]: https://github.com/databricks/tpch-dbgen.git,
587604
[2.17.1]: https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf
588605

606+
## TPCDS
607+
608+
Run the tpcds benchmark.
609+
610+
For data please clone `datafusion-benchmarks` repo which contains the predefined parquet data with SF1.
611+
612+
```shell
613+
git clone https://github.com/apache/datafusion-benchmarks
614+
```
615+
616+
Then run the benchmark with the following command:
617+
618+
```shell
619+
DATA_DIR=../../datafusion-benchmarks/tpcds/data/sf1/ ./benchmarks/bench.sh run tpcds
620+
```
621+
622+
Alternatively benchmark the specific query
623+
624+
```shell
625+
DATA_DIR=../../datafusion-benchmarks/tpcds/data/sf1/ ./benchmarks/bench.sh run tpcds 30
626+
```
627+
628+
More help
629+
630+
```shell
631+
cargo run --release --bin dfbench -- tpcds --help
632+
```
633+
589634
## External Aggregation
590635

591636
Run the benchmark for aggregations with limited memory.

benchmarks/bench.sh

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,9 @@ tpch10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB),
8787
tpch_csv10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), single csv file per table, hash join
8888
tpch_mem10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), query from memory
8989
90+
# TPC-DS Benchmarks
91+
tpcds: TPCDS inspired benchmark on Scale Factor (SF) 1 (~1GB), single parquet file per table, hash join
92+
9093
# Extended TPC-H Benchmarks
9194
sort_tpch: Benchmark of sorting speed for end-to-end sort queries on TPC-H dataset (SF=1)
9295
sort_tpch10: Benchmark of sorting speed for end-to-end sort queries on TPC-H dataset (SF=10)
@@ -220,6 +223,9 @@ main() {
220223
tpch_csv10)
221224
data_tpch "10" "csv"
222225
;;
226+
tpcds)
227+
data_tpcds
228+
;;
223229
clickbench_1)
224230
data_clickbench_1
225231
;;
@@ -388,6 +394,7 @@ main() {
388394
run_external_aggr
389395
run_nlj
390396
run_hj
397+
run_tpcds
391398
;;
392399
tpch)
393400
run_tpch "1" "parquet"
@@ -407,6 +414,9 @@ main() {
407414
tpch_mem10)
408415
run_tpch_mem "10"
409416
;;
417+
tpcds)
418+
run_tpcds
419+
;;
410420
cancellation)
411421
run_cancellation
412422
;;
@@ -601,6 +611,24 @@ data_tpch() {
601611
exit 1
602612
}
603613

614+
# Points to TPCDS data generation instructions
615+
data_tpcds() {
616+
TPCDS_DIR="${DATA_DIR}"
617+
618+
# Check if TPCDS data directory exists
619+
if [ ! -d "${TPCDS_DIR}" ]; then
620+
echo ""
621+
echo "For TPC-DS data generation, please clone the datafusion-benchmarks repository:"
622+
echo " git clone https://github.com/apache/datafusion-benchmarks"
623+
echo ""
624+
return 1
625+
fi
626+
627+
echo ""
628+
echo "TPC-DS data already exists in ${TPCDS_DIR}"
629+
echo ""
630+
}
631+
604632
# Runs the tpch benchmark
605633
run_tpch() {
606634
SCALE_FACTOR=$1
@@ -634,6 +662,37 @@ run_tpch_mem() {
634662
debug_run $CARGO_COMMAND --bin dfbench -- tpch --iterations 5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" -m --format parquet -o "${RESULTS_FILE}" ${QUERY_ARG}
635663
}
636664

665+
# Runs the tpcds benchmark
666+
run_tpcds() {
667+
TPCDS_DIR="${DATA_DIR}"
668+
669+
# Check if TPCDS data directory exists
670+
if [ ! -d "${TPCDS_DIR}" ]; then
671+
echo "Error: TPC-DS data directory does not exist: ${TPCDS_DIR}" >&2
672+
echo "" >&2
673+
echo "Please prepare TPC-DS data first by following instructions:" >&2
674+
echo " ./bench.sh data tpcds" >&2
675+
echo "" >&2
676+
exit 1
677+
fi
678+
679+
# Check if directory contains parquet files
680+
if ! find "${TPCDS_DIR}" -name "*.parquet" -print -quit | grep -q .; then
681+
echo "Error: TPC-DS data directory exists but contains no parquet files: ${TPCDS_DIR}" >&2
682+
echo "" >&2
683+
echo "Please prepare TPC-DS data first by following instructions:" >&2
684+
echo " ./bench.sh data tpcds" >&2
685+
echo "" >&2
686+
exit 1
687+
fi
688+
689+
RESULTS_FILE="${RESULTS_DIR}/tpcds_sf1.json"
690+
echo "RESULTS_FILE: ${RESULTS_FILE}"
691+
echo "Running tpcds benchmark..."
692+
693+
debug_run $CARGO_COMMAND --bin dfbench -- tpcds --iterations 5 --path "${TPCDS_DIR}" --query_path "../datafusion/core/tests/tpc-ds" --prefer_hash_join "${PREFER_HASH_JOIN}" -o "${RESULTS_FILE}" ${QUERY_ARG}
694+
}
695+
637696
# Runs the compile profile benchmark helper
638697
run_compile_profile() {
639698
local profiles=("$@")

benchmarks/compare_tpcds.sh

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
#!/usr/bin/env bash
2+
# Licensed to the Apache Software Foundation (ASF) under one
3+
# or more contributor license agreements. See the NOTICE file
4+
# distributed with this work for additional information
5+
# regarding copyright ownership. The ASF licenses this file
6+
# to you under the Apache License, Version 2.0 (the
7+
# "License"); you may not use this file except in compliance
8+
# with the License. You may obtain a copy of the License at
9+
#
10+
# http://www.apache.org/licenses/LICENSE-2.0
11+
#
12+
# Unless required by applicable law or agreed to in writing,
13+
# software distributed under the License is distributed on an
14+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
# KIND, either express or implied. See the License for the
16+
# specific language governing permissions and limitations
17+
# under the License.
18+
19+
# Compare TPC-DS benchmarks between two branches
20+
21+
set -e
22+
23+
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
24+
25+
usage() {
26+
echo "Usage: $0 <branch1> <branch2>"
27+
echo ""
28+
echo "Example: $0 main dev2"
29+
echo ""
30+
echo "Note: TPC-DS benchmarks are not currently implemented in bench.sh"
31+
exit 1
32+
}
33+
34+
BRANCH1=${1:-""}
35+
BRANCH2=${2:-""}
36+
37+
if [ -z "$BRANCH1" ] || [ -z "$BRANCH2" ]; then
38+
usage
39+
fi
40+
41+
# Store current branch
42+
CURRENT_BRANCH=$(git rev-parse --abbrev-ref HEAD)
43+
44+
echo "Comparing TPC-DS benchmarks: ${BRANCH1} vs ${BRANCH2}"
45+
46+
# Run benchmark on first branch
47+
git checkout "$BRANCH1"
48+
./benchmarks/bench.sh run tpcds
49+
50+
# Run benchmark on second branch
51+
git checkout "$BRANCH2"
52+
./benchmarks/bench.sh run tpcds
53+
54+
# Compare results
55+
./benchmarks/bench.sh compare "$BRANCH1" "$BRANCH2"
56+
57+
# Return to original branch
58+
git checkout "$CURRENT_BRANCH"

benchmarks/compare_tpch.sh

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
#!/usr/bin/env bash
2+
# Licensed to the Apache Software Foundation (ASF) under one
3+
# or more contributor license agreements. See the NOTICE file
4+
# distributed with this work for additional information
5+
# regarding copyright ownership. The ASF licenses this file
6+
# to you under the Apache License, Version 2.0 (the
7+
# "License"); you may not use this file except in compliance
8+
# with the License. You may obtain a copy of the License at
9+
#
10+
# http://www.apache.org/licenses/LICENSE-2.0
11+
#
12+
# Unless required by applicable law or agreed to in writing,
13+
# software distributed under the License is distributed on an
14+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
# KIND, either express or implied. See the License for the
16+
# specific language governing permissions and limitations
17+
# under the License.
18+
19+
# Compare TPC-H benchmarks between two branches
20+
21+
set -e
22+
23+
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
24+
25+
usage() {
26+
echo "Usage: $0 <branch1> <branch2>"
27+
echo ""
28+
echo "Example: $0 main dev2"
29+
exit 1
30+
}
31+
32+
BRANCH1=${1:-""}
33+
BRANCH2=${2:-""}
34+
35+
if [ -z "$BRANCH1" ] || [ -z "$BRANCH2" ]; then
36+
usage
37+
fi
38+
39+
# Store current branch
40+
CURRENT_BRANCH=$(git rev-parse --abbrev-ref HEAD)
41+
42+
echo "Comparing TPC-H benchmarks: ${BRANCH1} vs ${BRANCH2}"
43+
44+
# Run benchmark on first branch
45+
git checkout "$BRANCH1"
46+
./benchmarks/bench.sh run tpch
47+
48+
# Run benchmark on second branch
49+
git checkout "$BRANCH2"
50+
./benchmarks/bench.sh run tpch
51+
52+
# Compare results
53+
./benchmarks/bench.sh compare "$BRANCH1" "$BRANCH2"
54+
55+
# Return to original branch
56+
git checkout "$CURRENT_BRANCH"

benchmarks/src/bin/dfbench.rs

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc;
3434
static ALLOC: mimalloc::MiMalloc = mimalloc::MiMalloc;
3535

3636
use datafusion_benchmarks::{
37-
cancellation, clickbench, h2o, hj, imdb, nlj, sort_tpch, tpch,
37+
cancellation, clickbench, h2o, hj, imdb, nlj, sort_tpch, tpcds, tpch,
3838
};
3939

4040
#[derive(Debug, StructOpt)]
@@ -48,6 +48,7 @@ enum Options {
4848
Nlj(nlj::RunOpt),
4949
SortTpch(sort_tpch::RunOpt),
5050
Tpch(tpch::RunOpt),
51+
Tpcds(tpcds::RunOpt),
5152
}
5253

5354
// Main benchmark runner entrypoint
@@ -64,5 +65,6 @@ pub async fn main() -> Result<()> {
6465
Options::Nlj(opt) => opt.run().await,
6566
Options::SortTpch(opt) => opt.run().await,
6667
Options::Tpch(opt) => Box::pin(opt.run()).await,
68+
Options::Tpcds(opt) => Box::pin(opt.run()).await,
6769
}
6870
}

benchmarks/src/lib.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,5 +23,6 @@ pub mod hj;
2323
pub mod imdb;
2424
pub mod nlj;
2525
pub mod sort_tpch;
26+
pub mod tpcds;
2627
pub mod tpch;
2728
pub mod util;

0 commit comments

Comments
 (0)