Skip to content

Write DataFusion paper for (SIGMOD / VLDB / ICDE) #6782

@alamb

Description

@alamb

UPDATE: Final paper: https://dl.acm.org/doi/10.1145/3626246.3653368

Task List for SIGMOD Paper:

Per #6782 (comment), here is a list of TODO items:

  • Initial draft of Introduction
  • Initial draft of Background
  • Initial draft of Design @Dandandan adding info about join
  • Initial draft of Optimizations @Dandandan initial work on adding some contents
  • Initial draft of Extensibility
  • Initial draft of Related Work
  • Initial draft of Conclusion
  • Create Intro diagram ("teaser")
  • Create Architecture diagram
  • Take a first pass to add references
  • Create CMT account on https://cmt3.research.microsoft.com/SIGMODIndustry2024
  • Create Intro diagram ("teaser")
  • Iterate / Polish

Issues Blocking Full Performance Results

Issues that would make the results more compelling

  • (TO FILE or Find): Ensure performance doesn't degrade with larger core counts on some queries

Is your feature request related to a problem or challenge?

I would like to increase awareness of DataFusion in the broader technical community. One way to build mindshare is to get a paper / talk published in a prestigious conference like VLDB or SIGMOD

Writing a paper is a good way to show the strength of the arrow/datafusion.
Through the papers, more teachers, students and researcher may be involved, and contribute to the project.

Describe the solution you'd like

I would like to write a paper that explains DataFusion
Thesis: "You don't need a tightly integrated execution system to get good performance"

These blogs have some good material in the introduction
https://arrow.apache.org/blog/2023/06/24/datafusion-25.0.0/
https://arrow.apache.org/blog/2023/01/19/datafusion-16.0.0/

Then we would compare and contrast the approaches of other tightly integrated systems like pola.rs and duckdb to DataFusion

We would then describe the architecture of DataFusion and its many extension points (DataFrame, functions, aggregates, window functions, sinks, etc)

Performance:
Show DataFusion in the same ballpark as DuckDB for aggregation, grouping, etc (e.g. TPCH)
We already have this for querying parquet

Describe alternatives you've considered

VLDB: https://vldb.org/2024/?call-for-industrial-track

Submissions open December 6, 2023
Short abstracts deadline February 16, 2024
Full papers or extended abstracts deadline March 1, 2024
Notifications May 8, 2024
Camera-ready June 15, 2024

SIGMOD: https://2024.sigmod.org/calls_papers_important_dates.shtml

Industrial track: https://2024.sigmod.org/comingsoon.shtml (TBD)

Research paper submission round 4 (All Deadlines are 11:59 PM Pacific Time)

October 15, 2023: Paper submission
November 26-28, 2023: Author feedback phase
December 20, 2023: Notification of accept/reject/review again
January 20, 2024: Revised paper submission
February 23, 2024: Final notification of accept/reject

ICDE:

Industrial Track: https://icde2024.github.io/CFP_industry.html

All deadlines below are 5 PM Pacific Time.

Paper submission: Monday, November 20, 2023
Notification of accept/reject: Wednesday, January 31, 2024
Camera-ready deadline: Thursday, March 28, 2024

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions