Skip to content

[Epic] Make DataFusion a reliable foundation for building query engines #12723

@findepi

Description

@findepi

from https://datafusion.apache.org/

DataFusion is an extensible query engine written in Rust that uses Apache Arrow as its in-memory format. DataFusion’s target users are developers building fast and feature rich database and analytic systems, customized to particular workloads. See use cases for examples.

“Out of the box,” DataFusion offers SQL and Dataframe APIs, excellent performance, built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community. Python Bindings are also available.

ibid.

DataFusion features a full query planner, a columnar, streaming, multi-threaded, vectorized execution engine, and partitioned data sources. You can customize DataFusion at almost all points including additional data sources, query languages, functions, custom operators and more. See the Architecture section for more details.

The two passages indicate dual nature of DataFusion

First, it's a complete query engine, with which users can interact e.g. using datafusion-cli (or dfdb proposed in #11979) or DataFusion's SQL and DataFrames frontends.

Second is what @alamb often refers to as DataFusion being "LLVM for query engines", a building block for other applications, where components are re-usable.

(See also https://datafusion.apache.org/user-guide/faq.html#how-does-datafusion-compare-with-xyz, https://datafusion.apache.org/contributor-guide/architecture.html and https://docs.rs/datafusion/latest/datafusion/index.html#architecture)

A query engine may implement a dialect of SQL that is identical with DataFusion SQL, or different.
DataFusion doesn't need to know the details of the query engine being implemented (it is extensible rather than being union of all the query languages). DataFusion needs to allow expressing different query languages, providing a reliable and dialect-agnostic foundation for applications building on top of DataFusion.

A query engine and a query language have certain attributes

  • language: syntax
  • language: scope resolution rules
  • language: type system, including coercion rules
  • language: function repertoire
  • engine: data sources
  • engine: access control
  • engine: observability

Internally, such a query engine transforms user queries (according to syntax, scoping, typing rules) into relational algebra operations, optimizes and executes them. Sounds simple, but in reality this is super complex and this is where DataFusion really shines.

This epic issue is a collection of tasks important for achieving this goal. Its description should be expected to be a living document.

On the high level, it aims at separation of concerns. The two roles DataFusion has -- an implementation of DataFusion SQL, a particular SQL dialect along and being reusable building block -- they should be clearly separated so that dialect-specific behavior isn't implicitly inherited by components needed to be dialect-agnostic.

Goals and Overview

As a result DataFusion should have the following layers

Frontend

DataFusion frontend includes DataFusion SQL - DataFusion's implementation of SQL. DataFusion frontend also includes the DataFrame API.

It provides the following functionality

  • SQL frontend, with the syntax supported being a subset of syntax supported by the sqlparser crate that DataFusion SQL uses
  • the type system (currently Arrow data types (DataType) but that should change in [Proposal] Decouple logical from physical types #11513)
  • a repertoire of builtin functions
  • coercion rules required to make the language useufl (i.e. the coercion rules that exists today)
  • performs function resolution (taking function repertoire, their signatures and coercion rules into account)
  • the frontend remains extensible e.g. by adding more functions or providing table sources.
  • it may support extending the type system (Support Extension Types / User Defined Types in DataFusion #12644)

DataFusion main

DataFusion "main" or "core" represents a dialect- and syntax-agnostic execution query engine library, for building query engines.
It serves as an API for all query engine implementors that decide to build on top of DataFusion.

It provides the following functionality

  • type system: a simplified version of Arrow DataType ([Proposal] Decouple logical from physical types #11513)
    • it is currently assume that allowing extension types (Support Extension Types / User Defined Types in DataFusion #12644), or totally different data types, is not necessary at this layer. The applications building on top of DataFusion will have to translate their type systems to that of DataFusion main
    • it should support annotating the DataFusion types with additional attributes or traits, so that custom optimizer rules can reason about them, or DataFusion type system gets reacher
      • for example, for a language with a constrained varchar type, a correct implementation of UnwrapCastInComparison for input CAST(a_number AS varchar(4)) > '1234' should check whether a_number can safely be represented in 4 characters (would the cast fail?). it clearly can't fail for eg int8 type, and clearly can fail for int32 type
  • plan IR (intermediate representation) with strict typing rules and no implicit coercions (e.g. for a = b types of a and b match exactly)
  • does not perform type coercions
  • does not use function signatures (the formal signatures of functions inserted into the plan must match exactly)
  • does not have a function repertoire (doesn't need it, since function resolution already happened), but may provide implementations of functions that are reusable by the Frontend layer, or by applications building on top of DataFusion
  • has simple (stripped down) expression language. For example is doesn't have to discern between IS NULL and IS UNDEFINED or have a Wildcard expression, since those things are handled by the frontend layer.
  • optimizers which take the the IR plan and produce an equivalent better IR plan
  • since it doesn't do scope resolution, it doesn't need a concept of aliases, but needs to preserve output field names externally assigned (e.g. by the frontend layer)
  • the DataFusion main remains extensible e.g. by adding custom optimizer rules, or custom plan nodes

DataFusion execution

It provides the following functionality

Tasks

(Tasks to be added here once they are discovered and defined.)

  • as part of [DISCUSS] Document criteria for adding new features / what belongs in core DataFusion (e.g. sql syntax, functions, etc) #12357, clearly document project goals (Work “out of the box” + Customizable everything)
  • introduce new simple DataFusion types: [Proposal] Decouple logical from physical types #11513
  • allow functions to accept types according to new simple DataFusion types (relates to Simple Functions  #12635, but that issue goes far beyond that)
  • remove sqlparser dependency from all crates except datafusion-sql and datafusion-cli (it is OK to use for tests)
  • move analyzer out of optimizer and into the SQL frontend - analyzer plays supportive role towards a particular SQL dialect implementation and as such is part of the SQL frontend. It should not live inside the optimizer crate
  • separate implicitly typed expressions from explicitly typed expressions. The DataFrame API creates expressions directly and by its very frontendish nature they are not strictly typed, i.e. coercions are necessary. The IR layer needs expressions that have no implicit typing rules. The best approach would be to separate those expression hierarchies into two separate trees see also Proposal: introduced typed expressions, separate AST and IR #12604.
  • split SessionState into "frontend SessionState" and "core SessionState": the layers build on each other, so every layer is concerned with runtime-relevant attributes such as RuntimeEnv, but only the frontend needs to know the function repertoire for example ([EPIC] Easier extension configuration SessionState / SessionConfig #12550 seems related)
  • triage existing issues, cataloging them into frontend and core/main
  • split datafusion/core into frontend part and reusable part. the public crate name is datafusion which is perfect from a ready to consume frontend component, so maybe we could introduce core as datafusion-core crate
  • extend DataFusion core type system to better support building on top of it (see UnwrapCastInComparison example above) -- this is clearly vague and needs further specification
  • ... surely much more ...

Metadata

Metadata

Assignees

No one assigned

    Labels

    PROPOSAL EPICA proposal being discussed that is not yet fully underway

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions