-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
TLDR
- This is feedback from CMU on how to make DataFusion more compelling for Academic work.
- Does anyone have suggestions about additional DataFusion projects to suggest (see below)?
Background
CMU's DB group is running CMU-799 OSpecial Topics in Databases: Query Optimization this Spring 2025.
This course ... covers the classical and state-of-the-art methods and algorithms for converting SQL statements into physical query plans. Additional topics include cost models, feedback mechanisms, and adaptive query optimization. All class projects will be in the context of an open-source query optimizer service using real-world queries. The course is appropriate for graduate students in software systems and advanced undergraduates with nasty programming skills that are pursuing a database-centric lifestyle.
I would very much like to encourage a virtuous cycle of research/academic work on DataFusion --> contributing back and thus want to make DataFusion a compelling target for this type of class. I think the more use of DataFusion the more contributions we will attract and thus the better it will get for everyone.
Andy Pavlo connected me with @lmwnshn, who is helping organize the class, and provided the following feedback
Class Projects
@lmwnshn noted that the class includes a project (see Syllabus here):
The main component of this course will be the group programming project. Students will organize into groups and to implement a large system / prototype. The projects are designed to be (1) relevant to the materials discussed in class and (2) require a significant programming effort from all team members.
- Release Date: Feb 17, 2025
- Due Date: May 01, 2025
They were interested in working on some projects in DataFusion. Here were some ideas I had that I think might be good but would love to hear what other people think is important. This is also a great opportunity for others to get exposed to advanced techniques and learn how to work on large open source projects like DataFusion (obviously also how great our community is):
- Dynamic Filter Pushdown: Push Dynamic Join Predicates into Scan ("Sideways Information Passing", etc) #7955
- LATERAL JOINs: Feature request: Support for lateral joins #10048
- Range Joins / ASOF joins: ASOF join support / Specialize Range Joins #318
Ask: Does anyone else know of other interesting potential projects we can suggest??
Improving the Ease of Student Onboarding
@lmwnshn mentioned a few reasons he chose DuckDB over DataFusion for the first project(link), that were related to easier student onboarding
"Speed to running TPCH"
TPCH is an important benchmark for analytic systems. The ease of getting to the point of running TPCH is important
The DuckDB Experience:
- Download the duckdb binary
- Run commands
INSTALL tpch; LOAD tpch; CALL dbgen(sf = 1);
The DataFusion Experience:
- install Rust,
- install datafusion-cli,
- give them background knowledge about TPC-H
- point them to bench.sh (~750 LOC)
Ideas to improve the DataFusion experience:
- Add built in
dbgen
function todatafusion-cli
: Make it easier to run TPCH queries with datafusion-cli #14608 - Add more Pre-built binaries for
datafusion-cli
(in addition to homebrew) - Improve getting started instructions to make the process easier / more copy/paste
Easier / Better Optimizer Configuration
DuckDB has a dedicated optimizer section, easy to selectively disable optimization, see the duckdb_optimizers() function
DataFusion has many prefixed (datafusion.optimizer
settings)[https://datafusion.apache.org/user-guide/configs.html but it is not clear how these can enable/disable optimizer passes.
Ideas to improve the DataFusion experience:
- Make it easier to disable just the optimizers from datafusion-cli (I think this would require making a better distinction between passes required to run like
EnforceDistribution
and optimizations like `PushDownProjection). I will file a ticket / find one
Better explain plan
EXPLAIN visualization. I think DuckDB's is easier for students to look at.
Compare: DuckDB
┌─────────────────────────────┐
│┌───────────────────────────┐│
││ Physical Plan ││
│└───────────────────────────┘│
└─────────────────────────────┘
...
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ k │
│ length(Referer) │
│ Referer │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ FILTER │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ (Referer != '') │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 99997497 │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PARQUET_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Referer │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 99997497 │
└───────────────────────────┘
DataFusion
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Limit: skip=0, fetch=25 |
| | Sort: l DESC NULLS FIRST, fetch=25 |
| | Projection: regexp_replace(hits.parquet.Referer,Utf8("^https?://(?:www\.)?([^/]+)/.*$"),Utf8("\1")) AS k, AVG(character_length(hits.parquet.Referer)) AS l, COUNT(*) AS c, MIN(hits.parquet.Referer) |
| | Filter: COUNT(*) > Int64(100000) |
| | Aggregate: groupBy=[[regexp_replace(hits.parquet.Referer, Utf8("^https?://(?:www\.)?([^/]+)/.*$"), Utf8("\1"))]], aggr=[[AVG(CAST(character_length(hits.parquet.Referer) AS Float64)), COUNT(UInt8(1)) AS COUNT(*), MIN(hits.parquet.Referer)]] |
| | Filter: hits.parquet.Referer != Utf8("") |
| | TableScan: hits.parquet projection=[Referer], partial_filters=[hits.parquet.Referer != Utf8("")] |
...
We have discussed this previously:
I also left a note with a suggested way to get started implementing this: #9371 (comment)
enable_ident_normalization
doesn't work / paper cut
There appears to (still) be some non trivial confusion about identifier normalization
For example, #9399 (comment)
I have corrected the issue #9399 (reply in thread) but clearly there is more confusion
Ideas to improve DataFusion:
- Consolidate examples
- Improve documentation and
- Examples in
datafusion-python
Substait
@lmwnshn reported that for their first project(link), they tried running substrait plans created by Apache Calcite using DataFusion. The idea was to
- Use Calcite as frontend SQL --> substrait
- Take the optimized queries --> run on the engine
- turn on / off optimzer passes
The good news is that DataFusion works well compared to the other system they tried, DuckDB. DuckDB could run only 1 query and DataFusion could run 15 of 21 queries (:clappy: for @vbarua, @BlizzaraB and @Lordworms!). However, 6 queries still failed so they reverted to a SQL based approach using DuckDB (see above)
He also pointed out the the official substrait consumer-testing repo simply checks that that Isthmus -> DataFusion throws an exception.:
Learnings
Apparently DuckDb can read subtrait json, this was easier for students to understand than the protobuf format / BLOB format.
I think DataFusion can read this format too (thanks @Lordworms!)
- For example: https://github.com/apache/datafusion/blob/0edc3d99ad63399696135d3bb3fc387b38803d1f/datafusion/substrait/tests/testdata/tpch_substrait_plans/query_01_plan.json (read by
async fn tpch_plan_to_string(query_id: i32) -> Result<String> {
However, it is not super clear how to run such plans with datafusion-cli
Ideas to improve DataFusion:
- Consider adding some way to use substrait plans via
datafusion-cli
(follow the duckdb syntax) - It might make this more apparent if
datafusion-cli
allowed running substrait plans via the command line (perhaps a table function?) - Implement / update the
substrait-consumer
repo to ensure / show DataFusion substrait working