-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
UPDATE: Final paper: https://dl.acm.org/doi/10.1145/3626246.3653368
Task List for SIGMOD Paper:
Per #6782 (comment), here is a list of TODO items:
- Initial draft of Introduction
- Initial draft of Background
- Initial draft of Design @Dandandan adding info about join
- Initial draft of Optimizations @Dandandan initial work on adding some contents
- Initial draft of Extensibility
- Initial draft of Related Work
- Initial draft of Conclusion
- Create Intro diagram ("teaser")
- Create Architecture diagram
- Take a first pass to add references
- Create CMT account on https://cmt3.research.microsoft.com/SIGMODIndustry2024
- Create Intro diagram ("teaser")
- Iterate / Polish
Issues Blocking Full Performance Results
- Error: There isn't a common type to coerce Binary and Utf8 in LIKE expression #7342
- Invalid argument error: the data type binary has no natural order #7343
- Internal error: The "character_length" function can only accept strings #7344
- Internal error: The "regex_replace" function can only accept strings #7345
Issues that would make the results more compelling
- (TO FILE or Find): Ensure performance doesn't degrade with larger core counts on some queries
Is your feature request related to a problem or challenge?
I would like to increase awareness of DataFusion in the broader technical community. One way to build mindshare is to get a paper / talk published in a prestigious conference like VLDB or SIGMOD
Writing a paper is a good way to show the strength of the arrow/datafusion.
Through the papers, more teachers, students and researcher may be involved, and contribute to the project.
Describe the solution you'd like
I would like to write a paper that explains DataFusion
Thesis: "You don't need a tightly integrated execution system to get good performance"
These blogs have some good material in the introduction
https://arrow.apache.org/blog/2023/06/24/datafusion-25.0.0/
https://arrow.apache.org/blog/2023/01/19/datafusion-16.0.0/
Then we would compare and contrast the approaches of other tightly integrated systems like pola.rs and duckdb to DataFusion
We would then describe the architecture of DataFusion and its many extension points (DataFrame, functions, aggregates, window functions, sinks, etc)
Performance:
Show DataFusion in the same ballpark as DuckDB for aggregation, grouping, etc (e.g. TPCH)
We already have this for querying parquet
Describe alternatives you've considered
VLDB: https://vldb.org/2024/?call-for-industrial-track
Submissions open | December 6, 2023 |
---|---|
Short abstracts deadline | February 16, 2024 |
Full papers or extended abstracts deadline | March 1, 2024 |
Notifications | May 8, 2024 |
Camera-ready | June 15, 2024 |
SIGMOD: https://2024.sigmod.org/calls_papers_important_dates.shtml
Industrial track: https://2024.sigmod.org/comingsoon.shtml (TBD)
Research paper submission round 4 (All Deadlines are 11:59 PM Pacific Time)
October 15, 2023: Paper submission
November 26-28, 2023: Author feedback phase
December 20, 2023: Notification of accept/reject/review again
January 20, 2024: Revised paper submission
February 23, 2024: Final notification of accept/reject
ICDE:
Industrial Track: https://icde2024.github.io/CFP_industry.html
All deadlines below are 5 PM Pacific Time.
Paper submission: Monday, November 20, 2023
Notification of accept/reject: Wednesday, January 31, 2024
Camera-ready deadline: Thursday, March 28, 2024
Additional context
No response