-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Describe the bug
The join order chosen for TPCH query 17 is bad making datafusion take 50% longer to execute the query.
To Reproduce
Step 1: Create data:
cd arrow-datafusion/benchmarks
./bench.sh data tpch10
Step 2: Run query with datafusion-cli:
cd arrow-datafusion/benchmarks/data/tpch_sf10
datafusion-cli -c "select sum(l_extendedprice) / 7.0 as avg_yearly from lineitem, part where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'MED BOX' and l_quantity < ( select 0.2 * avg(l_quantity) from lineitem where l_partkey = p_partkey );"
Takes 7.52 seconds
+-------------------+
| avg_yearly |
+-------------------+
| 3295493.512857143 |
+-------------------+
1 row in set. Query took 7.525 seconds.
However, if we change the query slightly (swap the table order) it is much faster (4.5 seconds)
cd arrow-datafusion/benchmarks/data/tpch_sf10
datafusion-cli -c "select sum(l_extendedprice) / 7.0 as avg_yearly from part,lineitem where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'MED BOX' and l_quantity < ( select 0.2 * avg(l_quantity) from lineitem where l_partkey = p_partkey );"
DataFusion CLI v32.0.0
+-------------------+
| avg_yearly |
+-------------------+
| 3295493.512857143 |
+-------------------+
1 row in set. Query took 4.560 seconds.
Here is the difference:
select
sum(l_extendedprice) / 7.0 as avg_yearly from
- lineitem, part
+ part, lineitem
where
p_partkey = l_partkey
and p_brand = 'Brand#23'
and p_container = 'MED BOX'
and l_quantity < ( select 0.2 * avg(l_quantity) from lineitem where l_partkey = p_partkey );
Expected behavior
DataFusion should pick the correct join order
Analysis
What is going on? The answer is in the order of the joins. Here is the plan DataFusion makes for Q17, annotated from output row counts (I used the output of EXPLAIN ANALYZE
):
DataFusion TPCH SF10 Q17 Plan
annotated with output row counts
1 Row
┌────────────────┐
│ Aggregate │
│ │
This join is building a 60M row hash │ │
table and probing 2M rows, rather └────────────────┘
than building a 2M row hash table │
and probing a 60M row table ▼ 2M Rows
│ ┌────────────────┐
│ HashJoin (1) │
│ │ p_partkey = │
│ l_partkey │
│ 2044 Rows └───┬────────┬───┘
│ │ 2M Rows
│ ┌────────────────┘ └──────────────┐
│ │
│ ▼ ▼
┌────────────────┐ ┌────────────────┐
│ │ HashJoin (2) │ │ Aggregate │
─ ─ ─ ▶│ p_partkey = │ │ │
│ l_partkey │ │ │
└───┬────────┬───┘ └────────────────┘
60M Rows │ │ 2M Rows │
┌─────────┘ └──────────┐ │
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│Scan: lineitem │ │Scan: part │ │Scan: part │
│(no filters) │ │p_brand = .. │ │(no filters) │
│ │ │p_container = .. │ │ │
└──────────────────┘ └──────────────────┘ └──────────────────┘
Background: DataFusion Joins (I will also add this as documentation to datafusion)
Why does the order matter so much? To understand it fully, we need to understand how Hash Joins in DataFusion work.
The HashJoin operator in DataFusion takes two inputs:
┌───────────┐
│ HashJoin │
│ │
└───────────┘
│ │
┌─────┘ └─────┐
▼ ▼
┌────────────┐ ┌─────────────┐
│ Input │ │ Input │
│ [0] │ │ [1] │
└────────────┘ └─────────────┘
"build side" "probe side"
Execution proceeds in 2 stages:
┌────────────────┐ ┌────────────────┐
│ ┌─────────┐ │ │ ┌─────────┐ │
│ │ Hash │ │ │ │ Hash │ │
│ │ Table │ │ │ │ Table │ │
│ │(keys are│ │ │ │(keys are│ │
│ │equi join│ │ │ │equi join│ │ Stage 2: batches from
Stage 1: the │ │columns) │ │ │ │columns) │ │ the probe side are
*entire* build │ │ │ │ │ │ │ │ streamed through, and
side is read │ └─────────┘ │ │ └─────────┘ │ checked against the
into the hash │ ▲ │ │ ▲ │ contents of the hash
table │ HashJoin │ │ HashJoin │ table
└──────┼─────────┘ └──────────┼─────┘
─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
│ │
│ │
┌────────────┐ ┌────────────┐
│RecordBatch │ │RecordBatch │
└────────────┘ └────────────┘
┌────────────┐ ┌────────────┐
│RecordBatch │ │RecordBatch │
└────────────┘ └────────────┘
... ...
┌────────────┐ ┌────────────┐
│RecordBatch │ │RecordBatch │
└────────────┘ └────────────┘
build side probe side
This asymmetry in behavior has the important consequence that it is very important that the smaller side is hashed
So this means in a classic "Star Schema Query", the optimal plan will be a "Right Deep Tree" , where there is one large table and several smaller "dimension" tables, with predicates. The optimal DataFusion will put this large table as the probe side on the lowest join:
┌───────────┐
│ HashJoin │
│ │
└───────────┘
│ │
┌───────┘ └──────────┐
▼ ▼
┌───────────────┐ ┌───────────┐
│ small table 3 │ │ HashJoin │
│ "dimension" │ │ │
└───────────────┘ └───┬───┬───┘
┌──────────┘ └───────┐
│ │
▼ ▼
┌───────────────┐ ┌───────────┐
│ small table 2 │ │ HashJoin │
│ "dimension" │ │ │
└───────────────┘ └───┬───┬───┘
┌────────┘ └────────┐
│ │
▼ ▼
┌───────────────┐ ┌───────────────┐
│ small table 1 │ │ large table │
│ "dimension" │ │ "fact" │
└───────────────┘ └───────────────┘
Additional context
This is likely one of the root causes of #5646