-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
As part of the 16.0.0 release I would like to write a blog about datafusion on https://arrow.apache.org/blog/ (source at https://github.com/apache/arrow-site)
I am thinking about a basic theme like datafusion is leading the charge to bring advanced OLAP technology everywhere
I would like to highlight the theme summarized by Andy Pavlo in https://ottertune.com/blog/2022-databases-retrospective/
The long-term trend to watch is the proliferation of frameworks like Velox, DataFusion, and Polars. Along with projects like Substrait, the commoditization of these query execution components means that all OLAP DBMSs will be roughly equivalent in the next five years. Instead of building a new DBMS entirely from scratch or hard forking an existing system (e.g., how Firebolt forked Clickhouse), people are better off using an extensible framework like Velox. This means that every DBMS will have the same vectorized execution capabilities that were unique to Snowflake ten years ago. And since in the cloud, the storage layer is the same for everyone (e.g., Amazon controls EBS/S3), the critical differentiator between DBMS offerings will be things that are difficult to quantify, like UI/UX stuff and query optimization.
Some supporting evidence:
- Several new databases built on datafusion (synnada.ai, greptimedb, probably others)
- GA of InfluxDB IOx
New features:
- Advanced Windowing functions (like unbounded windows)
- Join support (TODO gather more details)
- Optimizer advancements
Future directions:
1 .Improved grouping / sorting performance
2. RLE (Run End Encoding support
etc
Here is the most recent blog about datafusion I know about https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0 -- source at https://github.com/apache/arrow-site/blob/master/_posts/2022-10-25-datafusion-13.0.0.md
Please leave comments with your suggestions / ideas!