Bad Join Order for TPCH Q17 results in slow performance

### Describe the bug

The join order chosen for TPCH query 17 is bad making datafusion take 50% longer to execute the query. 



### To Reproduce


#### Step 1: Create data:

```shell
cd arrow-datafusion/benchmarks
./bench.sh  data tpch10
```

#### Step 2: Run query with datafusion-cli:

```shell
cd arrow-datafusion/benchmarks/data/tpch_sf10
datafusion-cli -c "select	sum(l_extendedprice) / 7.0 as avg_yearly	from	lineitem,	part	where	p_partkey = l_partkey	and p_brand = 'Brand#23'	and p_container = 'MED BOX'	and l_quantity < (	select	0.2 * avg(l_quantity)	from	lineitem	where	l_partkey = p_partkey	);"
```

Takes 7.52 seconds
```
+-------------------+
| avg_yearly        |
+-------------------+
| 3295493.512857143 |
+-------------------+
1 row in set. Query took 7.525 seconds.
```

However, if we change the query slightly (swap the table order) it is much faster (4.5 seconds)

```shell
cd arrow-datafusion/benchmarks/data/tpch_sf10
datafusion-cli -c "select   sum(l_extendedprice) / 7.0 as avg_yearly        from    part,lineitem   where   p_partkey = l_partkey   and p_brand = 'Brand#23'        and p_container = 'MED BOX' and l_quantity < (      select  0.2 * avg(l_quantity)   from    lineitem        where   l_partkey = p_partkey   );"
```

```
DataFusion CLI v32.0.0
+-------------------+
| avg_yearly        |
+-------------------+
| 3295493.512857143 |
+-------------------+
1 row in set. Query took 4.560 seconds.
```

Here is the difference:
```diff
select
sum(l_extendedprice) / 7.0 as avg_yearly from
- lineitem, part
+ part, lineitem
where
  p_partkey = l_partkey
  and p_brand = 'Brand#23'
  and p_container = 'MED BOX'
  and l_quantity < (	select	0.2 * avg(l_quantity)	from	lineitem where	l_partkey = p_partkey	);
```


### Expected behavior

DataFusion should pick the correct join order

### Analysis

What is going on? The answer is in the order of the joins. Here is the plan DataFusion makes for Q17, annotated from output row counts (I used the output of `EXPLAIN ANALYZE`):

```
                                       DataFusion TPCH SF10 Q17 Plan                  
                                      annotated with output row counts                
                                                                                      
                                                                                      
                                                                                      
                                                            1 Row                     
                                                                                      
                                                ┌────────────────┐                    
                                                │   Aggregate    │                    
                                                │                │                    
This join is building a 60M row hash            │                │                    
 table and probing 2M rows, rather              └────────────────┘                    
 than building a 2M row hash table                       │                            
    and probing a 60M row table                          ▼        2M Rows             
                  │                             ┌────────────────┐                    
                                                │  HashJoin (1)  │                    
                  │                             │   p_partkey =  │                    
                                                │   l_partkey    │                    
                  │         2044 Rows           └───┬────────┬───┘                    
                                                    │        │            2M Rows     
                  │                ┌────────────────┘        └──────────────┐         
                                   │                                        │         
                  │                ▼                                        ▼         
                          ┌────────────────┐                       ┌────────────────┐ 
                  │       │  HashJoin (2)  │                       │   Aggregate    │ 
                   ─ ─ ─ ▶│   p_partkey =  │                       │                │ 
                          │   l_partkey    │                       │                │ 
                          └───┬────────┬───┘                       └────────────────┘ 
           60M Rows           │        │        2M Rows                     │         
                    ┌─────────┘        └──────────┐                         │         
                    │                             │                         │         
                    ▼                             ▼                         ▼         
          ┌──────────────────┐          ┌──────────────────┐      ┌──────────────────┐
          │Scan: lineitem    │          │Scan: part        │      │Scan: part        │
          │(no filters)      │          │p_brand = ..      │      │(no filters)      │
          │                  │          │p_container = ..  │      │                  │
          └──────────────────┘          └──────────────────┘      └──────────────────┘
```


#### Background: DataFusion Joins (I will also add this as documentation to datafusion)

Why does the order matter so much? To understand it fully, we need to understand how Hash Joins in DataFusion work. 


The HashJoin operator in DataFusion takes two inputs:

```text
         ┌───────────┐         
         │ HashJoin  │         
         │           │         
         └───────────┘         
             │   │             
       ┌─────┘   └─────┐       
       ▼               ▼       
┌────────────┐  ┌─────────────┐
│   Input    │  │    Input    │
│    [0]     │  │     [1]     │
└────────────┘  └─────────────┘
                               
 "build side"    "probe side"  
```

Execution proceeds in 2 stages:

```text
                ┌────────────────┐          ┌────────────────┐                        
                │ ┌─────────┐    │          │ ┌─────────┐    │                        
                │ │  Hash   │    │          │ │  Hash   │    │                        
                │ │  Table  │    │          │ │  Table  │    │                        
                │ │(keys are│    │          │ │(keys are│    │                        
                │ │equi join│    │          │ │equi join│    │  Stage 2: batches from 
 Stage 1: the   │ │columns) │    │          │ │columns) │    │    the probe side are  
*entire* build  │ │         │    │          │ │         │    │  streamed through, and 
 side is read   │ └─────────┘    │          │ └─────────┘    │   checked against the  
into the hash   │      ▲         │          │          ▲     │   contents of the hash 
    table       │       HashJoin │          │  HashJoin      │          table         
                └──────┼─────────┘          └──────────┼─────┘                        
            ─ ─ ─ ─ ─ ─                                 ─ ─ ─ ─ ─ ─ ─                 
           │                                                         │                
                                                                                      
           │                                                         │                
    ┌────────────┐                                            ┌────────────┐          
    │RecordBatch │                                            │RecordBatch │          
    └────────────┘                                            └────────────┘          
    ┌────────────┐                                            ┌────────────┐          
    │RecordBatch │                                            │RecordBatch │          
    └────────────┘                                            └────────────┘          
          ...                                                       ...               
    ┌────────────┐                                            ┌────────────┐          
    │RecordBatch │                                            │RecordBatch │          
    └────────────┘                                            └────────────┘          
                                                                                      
       build side                                                probe side           
                                                                                      
```

This asymmetry in behavior has the important consequence that  it is very important that the *smaller* side is hashed

So this means in a classic "Star Schema Query", the optimal plan will be a **"Right Deep Tree"** , where there is one large table and several smaller "dimension" tables, with predicates. The optimal DataFusion will put this large table as the probe side on the lowest join:

```text
            ┌───────────┐                                    
            │ HashJoin  │                                    
            │           │                                    
            └───────────┘                                    
                │   │                                        
        ┌───────┘   └──────────┐                             
        ▼                      ▼                             
┌───────────────┐        ┌───────────┐                       
│ small table 3 │        │ HashJoin  │                       
│  "dimension"  │        │           │                       
└───────────────┘        └───┬───┬───┘                       
                  ┌──────────┘   └───────┐                   
                  │                      │                   
                  ▼                      ▼                   
          ┌───────────────┐        ┌───────────┐             
          │ small table 2 │        │ HashJoin  │             
          │  "dimension"  │        │           │             
          └───────────────┘        └───┬───┬───┘             
                              ┌────────┘   └────────┐        
                              │                     │        
                              ▼                     ▼        
                      ┌───────────────┐     ┌───────────────┐
                      │ small table 1 │     │  large table  │
                      │  "dimension"  │     │    "fact"     │
                      └───────────────┘     └───────────────┘
```



### Additional context

This is likely one of the root causes of https://github.com/apache/arrow-datafusion/issues/5646

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bad Join Order for TPCH Q17 results in slow performance #7949

Describe the bug

To Reproduce

Step 1: Create data:

Step 2: Run query with datafusion-cli:

Expected behavior

Analysis

Background: DataFusion Joins (I will also add this as documentation to datafusion)

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bad Join Order for TPCH Q17 results in slow performance #7949

Description

Describe the bug

To Reproduce

Step 1: Create data:

Step 2: Run query with datafusion-cli:

Expected behavior

Analysis

Background: DataFusion Joins (I will also add this as documentation to datafusion)

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions