Skip to content

Bad Join Order for TPCH Q17 results in slow performance #7949

@alamb

Description

@alamb

Describe the bug

The join order chosen for TPCH query 17 is bad making datafusion take 50% longer to execute the query.

To Reproduce

Step 1: Create data:

cd arrow-datafusion/benchmarks
./bench.sh  data tpch10

Step 2: Run query with datafusion-cli:

cd arrow-datafusion/benchmarks/data/tpch_sf10
datafusion-cli -c "select	sum(l_extendedprice) / 7.0 as avg_yearly	from	lineitem,	part	where	p_partkey = l_partkey	and p_brand = 'Brand#23'	and p_container = 'MED BOX'	and l_quantity < (	select	0.2 * avg(l_quantity)	from	lineitem	where	l_partkey = p_partkey	);"

Takes 7.52 seconds

+-------------------+
| avg_yearly        |
+-------------------+
| 3295493.512857143 |
+-------------------+
1 row in set. Query took 7.525 seconds.

However, if we change the query slightly (swap the table order) it is much faster (4.5 seconds)

cd arrow-datafusion/benchmarks/data/tpch_sf10
datafusion-cli -c "select   sum(l_extendedprice) / 7.0 as avg_yearly        from    part,lineitem   where   p_partkey = l_partkey   and p_brand = 'Brand#23'        and p_container = 'MED BOX' and l_quantity < (      select  0.2 * avg(l_quantity)   from    lineitem        where   l_partkey = p_partkey   );"
DataFusion CLI v32.0.0
+-------------------+
| avg_yearly        |
+-------------------+
| 3295493.512857143 |
+-------------------+
1 row in set. Query took 4.560 seconds.

Here is the difference:

select
sum(l_extendedprice) / 7.0 as avg_yearly from
- lineitem, part
+ part, lineitem
where
  p_partkey = l_partkey
  and p_brand = 'Brand#23'
  and p_container = 'MED BOX'
  and l_quantity < (	select	0.2 * avg(l_quantity)	from	lineitem where	l_partkey = p_partkey	);

Expected behavior

DataFusion should pick the correct join order

Analysis

What is going on? The answer is in the order of the joins. Here is the plan DataFusion makes for Q17, annotated from output row counts (I used the output of EXPLAIN ANALYZE):

                                       DataFusion TPCH SF10 Q17 Plan                  
                                      annotated with output row counts                
                                                                                      
                                                                                      
                                                                                      
                                                            1 Row                     
                                                                                      
                                                ┌────────────────┐                    
                                                │   Aggregate    │                    
                                                │                │                    
This join is building a 60M row hash            │                │                    
 table and probing 2M rows, rather              └────────────────┘                    
 than building a 2M row hash table                       │                            
    and probing a 60M row table                          ▼        2M Rows             
                  │                             ┌────────────────┐                    
                                                │  HashJoin (1)  │                    
                  │                             │   p_partkey =  │                    
                                                │   l_partkey    │                    
                  │         2044 Rows           └───┬────────┬───┘                    
                                                    │        │            2M Rows     
                  │                ┌────────────────┘        └──────────────┐         
                                   │                                        │         
                  │                ▼                                        ▼         
                          ┌────────────────┐                       ┌────────────────┐ 
                  │       │  HashJoin (2)  │                       │   Aggregate    │ 
                   ─ ─ ─ ▶│   p_partkey =  │                       │                │ 
                          │   l_partkey    │                       │                │ 
                          └───┬────────┬───┘                       └────────────────┘ 
           60M Rows           │        │        2M Rows                     │         
                    ┌─────────┘        └──────────┐                         │         
                    │                             │                         │         
                    ▼                             ▼                         ▼         
          ┌──────────────────┐          ┌──────────────────┐      ┌──────────────────┐
          │Scan: lineitem    │          │Scan: part        │      │Scan: part        │
          │(no filters)      │          │p_brand = ..      │      │(no filters)      │
          │                  │          │p_container = ..  │      │                  │
          └──────────────────┘          └──────────────────┘      └──────────────────┘

Background: DataFusion Joins (I will also add this as documentation to datafusion)

Why does the order matter so much? To understand it fully, we need to understand how Hash Joins in DataFusion work.

The HashJoin operator in DataFusion takes two inputs:

         ┌───────────┐         
         │ HashJoin  │         
         │           │         
         └───────────┘         
             │   │             
       ┌─────┘   └─────┐       
       ▼               ▼       
┌────────────┐  ┌─────────────┐
│   Input    │  │    Input    │
│    [0]     │  │     [1]     │
└────────────┘  └─────────────┘
                               
 "build side"    "probe side"  

Execution proceeds in 2 stages:

                ┌────────────────┐          ┌────────────────┐                        
                │ ┌─────────┐    │          │ ┌─────────┐    │                        
                │ │  Hash   │    │          │ │  Hash   │    │                        
                │ │  Table  │    │          │ │  Table  │    │                        
                │ │(keys are│    │          │ │(keys are│    │                        
                │ │equi join│    │          │ │equi join│    │  Stage 2: batches from 
 Stage 1: the   │ │columns) │    │          │ │columns) │    │    the probe side are  
*entire* build  │ │         │    │          │ │         │    │  streamed through, and 
 side is read   │ └─────────┘    │          │ └─────────┘    │   checked against the  
into the hash   │      ▲         │          │          ▲     │   contents of the hash 
    table       │       HashJoin │          │  HashJoin      │          table         
                └──────┼─────────┘          └──────────┼─────┘                        
            ─ ─ ─ ─ ─ ─                                 ─ ─ ─ ─ ─ ─ ─                 
           │                                                         │                
                                                                                      
           │                                                         │                
    ┌────────────┐                                            ┌────────────┐          
    │RecordBatch │                                            │RecordBatch │          
    └────────────┘                                            └────────────┘          
    ┌────────────┐                                            ┌────────────┐          
    │RecordBatch │                                            │RecordBatch │          
    └────────────┘                                            └────────────┘          
          ...                                                       ...               
    ┌────────────┐                                            ┌────────────┐          
    │RecordBatch │                                            │RecordBatch │          
    └────────────┘                                            └────────────┘          
                                                                                      
       build side                                                probe side           
                                                                                      

This asymmetry in behavior has the important consequence that it is very important that the smaller side is hashed

So this means in a classic "Star Schema Query", the optimal plan will be a "Right Deep Tree" , where there is one large table and several smaller "dimension" tables, with predicates. The optimal DataFusion will put this large table as the probe side on the lowest join:

            ┌───────────┐                                    
            │ HashJoin  │                                    
            │           │                                    
            └───────────┘                                    
                │   │                                        
        ┌───────┘   └──────────┐                             
        ▼                      ▼                             
┌───────────────┐        ┌───────────┐                       
│ small table 3 │        │ HashJoin  │                       
│  "dimension"  │        │           │                       
└───────────────┘        └───┬───┬───┘                       
                  ┌──────────┘   └───────┐                   
                  │                      │                   
                  ▼                      ▼                   
          ┌───────────────┐        ┌───────────┐             
          │ small table 2 │        │ HashJoin  │             
          │  "dimension"  │        │           │             
          └───────────────┘        └───┬───┬───┘             
                              ┌────────┘   └────────┐        
                              │                     │        
                              ▼                     ▼        
                      ┌───────────────┐     ┌───────────────┐
                      │ small table 1 │     │  large table  │
                      │  "dimension"  │     │    "fact"     │
                      └───────────────┘     └───────────────┘

Additional context

This is likely one of the root causes of #5646

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions