[Epic] Make DataFusion a reliable foundation for building query engines

from https://datafusion.apache.org/

> DataFusion is an extensible query engine written in [Rust](http://rustlang.org/) that uses [Apache Arrow](https://arrow.apache.org/) as its in-memory format. DataFusion’s target users are developers building fast and feature rich database and analytic systems, customized to particular workloads. See [use cases](https://datafusion.apache.org/user-guide/introduction.html#use-cases) for examples.
> 
> “Out of the box,” DataFusion offers [SQL](https://datafusion.apache.org/user-guide/sql/index.html) and [Dataframe](https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html) APIs, excellent [performance](https://benchmark.clickhouse.com/), built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community. [Python Bindings](https://github.com/apache/datafusion-python) are also available.

ibid.

> DataFusion features a full query planner, a columnar, streaming, multi-threaded, vectorized execution engine, and partitioned data sources. You can customize DataFusion at almost all points including additional data sources, query languages, functions, custom operators and more. See the [Architecture](https://datafusion.apache.org/contributor-guide/architecture.html) section for more details.

The two passages indicate dual nature of DataFusion

 First, it's a complete query engine, with which users can interact e.g. using datafusion-cli (or `dfdb` proposed in https://github.com/apache/datafusion/issues/11979) or DataFusion's SQL and DataFrames frontends.

Second is what @alamb often refers to as DataFusion being "LLVM for query engines", a building block for other applications, where components are re-usable. 

(See also https://datafusion.apache.org/user-guide/faq.html#how-does-datafusion-compare-with-xyz, https://datafusion.apache.org/contributor-guide/architecture.html and https://docs.rs/datafusion/latest/datafusion/index.html#architecture)

A query engine may implement a dialect of SQL that is identical with DataFusion SQL, or different. 
DataFusion doesn't need to know the details of the query engine being implemented (it is extensible rather than being union of all the query languages). DataFusion needs to allow expressing different query languages, providing a reliable and dialect-agnostic foundation for applications building on top of DataFusion.

A query engine and a query language have certain attributes

- language: syntax
- language: scope resolution rules
- language: type system, including coercion rules
- language: function repertoire
- engine: data sources
- engine: access control
- engine: observability

Internally, such a query engine transforms user queries (according to syntax, scoping, typing rules) into relational algebra operations, optimizes and executes them. Sounds simple, but in reality this is super complex and this is where DataFusion really shines.

This epic issue is a collection of tasks important for achieving this goal. Its description should be expected to be a living document.

On the high level, it aims at separation of concerns. The two roles DataFusion has -- an implementation of DataFusion SQL, a particular SQL dialect along _and_ being reusable building block -- they should be clearly separated so that dialect-specific behavior isn't implicitly inherited by components needed to be dialect-agnostic.

### Goals and Overview

As a result DataFusion should have the following layers

#### Frontend

DataFusion frontend includes DataFusion SQL - DataFusion's implementation of SQL.  DataFusion frontend also includes the DataFrame API.

It provides the following functionality

- SQL frontend, with the syntax supported being a subset of syntax supported by the `sqlparser` crate that DataFusion SQL uses
- the type system (currently Arrow data types (`DataType`) but that should change in https://github.com/apache/datafusion/issues/11513)
- a repertoire of builtin functions
- coercion rules required  to make the language useufl (i.e. the coercion rules that exists today)
- performs function resolution (taking function repertoire, their signatures and coercion rules into account)
- the frontend remains extensible e.g. by adding more functions or providing table sources.
- it may support extending the type system (https://github.com/apache/datafusion/issues/12644)

#### DataFusion main

DataFusion "main" or "core" represents a dialect- and syntax-agnostic execution query engine library, for building query engines.
It serves as an API for all query engine implementors that decide to build on top of DataFusion.

It provides the following functionality

- type system: a simplified version of Arrow `DataType` (https://github.com/apache/datafusion/issues/11513)
  - it is currently assume that allowing extension types (https://github.com/apache/datafusion/issues/12644), or totally different data types, is not necessary at this layer. The applications building on top of DataFusion will have to translate their type systems to that of DataFusion main
  - it should support annotating the DataFusion types with additional attributes or traits, so that custom optimizer rules can reason about them, or DataFusion type system gets reacher
     - for example, for a language with a constrained varchar type, a correct implementation of `UnwrapCastInComparison` for input `CAST(a_number AS varchar(4)) > '1234'` should check whether a_number can safely be represented in 4 characters (would the cast fail?). it clearly can't fail for eg int8 type, and clearly can fail for int32 type
- plan IR (intermediate representation) with strict typing rules and no implicit coercions (e.g. for `a = b` types of `a` and `b` match exactly)
- does not perform type coercions
- does not use function signatures (the formal signatures of functions inserted into the plan must match exactly)
- does not have a function repertoire (doesn't need it, since function resolution already happened), but _may_ provide implementations of functions that are reusable by the Frontend layer, or by applications building on top of DataFusion
- has simple (stripped down) expression language. For example is doesn't have to discern between `IS NULL` and `IS UNDEFINED` or have a Wildcard expression, since those things are handled by the frontend layer.
- optimizers which take the the IR plan and produce an equivalent better IR plan 
- since it doesn't do scope resolution, it doesn't need a concept of aliases, but needs to preserve output field names externally assigned (e.g. by the frontend layer)
  - see also https://github.com/apache/datafusion/issues/6543
  - see also https://github.com/apache/datafusion/issues/8379
  - see also https://github.com/apache/datafusion/issues/1468
- the DataFusion main remains extensible e.g. by adding custom optimizer rules, or custom plan nodes

#### DataFusion execution

It provides the following functionality

- physical planning (`ExecutionPlan`, `PhysicalExpr`)
- type system: a simplified version of Arrow `DataType` (https://github.com/apache/datafusion/issues/11513)
  - using Arrow `DataType` directly is also an option, but would prevent runtime-adaptive execution, see https://github.com/apache/datafusion/issues/12720, so simplified types would be strongly preferred, https://github.com/apache/datafusion/issues/11513
- has no coercion rules
- executes queries and produces results 

### Tasks

(Tasks to be added here once they are discovered and defined.)

- [ ] as part of https://github.com/apache/datafusion/issues/12357, clearly document project goals (Work “out of the box” + Customizable everything)
- [ ] introduce new simple DataFusion types: https://github.com/apache/datafusion/issues/11513
- [ ] allow functions to accept types according to new simple DataFusion types (relates to https://github.com/apache/datafusion/issues/12635, but that issue goes far beyond that)
- [ ] remove `sqlparser` dependency from all crates except `datafusion-sql` and `datafusion-cli` (it is OK to use for tests)
- [ ] move analyzer out of optimizer and into the SQL frontend - analyzer plays supportive role towards a particular SQL dialect implementation and as such is part of the SQL frontend. It should not live inside the `optimizer` crate
- [ ] separate implicitly typed expressions from explicitly typed expressions. The DataFrame API creates expressions directly and by its very frontendish nature they are not strictly typed, i.e. coercions are necessary. The IR layer needs expressions that have no implicit typing rules. The best approach would be to separate those expression hierarchies into two separate trees see also https://github.com/apache/datafusion/issues/12604.
- [ ] split `SessionState` into "frontend SessionState" and "core SessionState": the layers build on each other, so every layer is concerned with runtime-relevant attributes such as RuntimeEnv, but only the frontend needs to know the function repertoire for example (https://github.com/apache/datafusion/issues/12550 seems related)
- [ ] triage existing issues, cataloging them into frontend and core/main
- [ ] split `datafusion/core` into frontend part and reusable part. the public crate name is `datafusion` which is perfect from a ready to consume frontend component, so maybe we could introduce core as `datafusion-core` crate
- [ ] extend DataFusion core type system to better support building on top of it (see `UnwrapCastInComparison` example above) -- this is clearly vague and needs further specification
- ... surely much more ...



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Epic] Make DataFusion a reliable foundation for building query engines #12723

Goals and Overview

Frontend

DataFusion main

DataFusion execution

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Epic] Make DataFusion a reliable foundation for building query engines #12723

Description

Goals and Overview

Frontend

DataFusion main

DataFusion execution

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions