-
Couldn't load subscription status.
- Fork 1.7k
docs: Add Expr library developer page
#7359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
ac44a59
docs: fill out expr page
tshauck 242470c
fix: fixup a few issues
tshauck d70c2e2
docs: links to examples
tshauck 1fa3028
fix: impv variable name
tshauck d716c82
Apply suggestions from code review
tshauck 4d87440
fix: feedback updates
tshauck d8c9add
docs: update w/ feedback
tshauck File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,4 +19,166 @@ | |
|
|
||
| # Working with Exprs | ||
|
|
||
| Coming Soon | ||
| <!-- https://github.com/apache/arrow-datafusion/issues/7304 --> | ||
|
|
||
| `Expr` is short for "expression". It is a core abstraction in DataFusion for representing a computation, and follows the standard "expression tree" abstraction found in most compilers and databases. | ||
|
|
||
| For example, the SQL expression `a + b` would be represented as an `Expr` with a `BinaryExpr` variant. A `BinaryExpr` has a left and right `Expr` and an operator. | ||
|
|
||
| As another example, the SQL expression `a + b * c` would be represented as an `Expr` with a `BinaryExpr` variant. The left `Expr` would be `a` and the right `Expr` would be another `BinaryExpr` with a left `Expr` of `b` and a right `Expr` of `c`. As a classic expression tree, this would look like: | ||
|
|
||
| ```text | ||
| ┌────────────────────┐ | ||
| │ BinaryExpr │ | ||
| │ op: + │ | ||
| └────────────────────┘ | ||
| ▲ ▲ | ||
| ┌───────┘ └────────────────┐ | ||
| │ │ | ||
| ┌────────────────────┐ ┌────────────────────┐ | ||
| │ Expr::Col │ │ BinaryExpr │ | ||
| │ col: a │ │ op: * │ | ||
| └────────────────────┘ └────────────────────┘ | ||
| ▲ ▲ | ||
| ┌────────┘ └─────────┐ | ||
| │ │ | ||
| ┌────────────────────┐ ┌────────────────────┐ | ||
| │ Expr::Col │ │ Expr::Col │ | ||
| │ col: b │ │ col: c │ | ||
| └────────────────────┘ └────────────────────┘ | ||
| ``` | ||
|
|
||
| As the writer of a library, you may want to use or create `Expr`s to represent computations that you want to perform. This guide will walk you through how to make your own scalar UDF as an `Expr` and how to rewrite `Expr`s to inline the simple UDF. | ||
|
|
||
| There are also executable examples for working with `Expr`s: | ||
|
|
||
| - [rewrite_expr.rs](../../../datafusion-examples/examples/catalog.rs) | ||
| - [expr_api.rs](../../../datafusion-examples/examples/expr_api.rs) | ||
|
|
||
| ## A Scalar UDF Example | ||
|
|
||
| We'll use a `ScalarUDF` expression as our example. This necessitates implementing an actual UDF, and for ease we'll use the same example from the [adding UDFs](./adding-udfs.md) guide. | ||
|
|
||
| So assuming you've written that function, you can use it to create an `Expr`: | ||
|
|
||
| ```rust | ||
| let add_one_udf = create_udf( | ||
| "add_one", | ||
| vec![DataType::Int64], | ||
| Arc::new(DataType::Int64), | ||
| Volatility::Immutable, | ||
| make_scalar_function(add_one), // <-- the function we wrote | ||
| ); | ||
|
|
||
| // make the expr `add_one(5)` | ||
| let expr = add_one_udf.call(vec![lit(5)]); | ||
|
|
||
| // make the expr `add_one(my_column)` | ||
| let expr = add_one_udf.call(vec![col("my_column")]); | ||
| ``` | ||
|
|
||
| If you'd like to learn more about `Expr`s, before we get into the details of creating and rewriting them, you can read the [expression user-guide](./../user-guide/expressions.md). | ||
|
|
||
| ## Rewriting Exprs | ||
|
|
||
| Rewriting Expressions is the process of taking an `Expr` and transforming it into another `Expr`. This is useful for a number of reasons, including: | ||
|
|
||
| - Simplifying `Expr`s to make them easier to evaluate | ||
| - Optimizing `Expr`s to make them faster to evaluate | ||
| - Converting `Expr`s to other forms, e.g. converting a `BinaryExpr` to a `CastExpr` | ||
|
|
||
| In our example, we'll use rewriting to update our `add_one` UDF, to be rewritten as a `BinaryExpr` with a `Literal` of 1. We're effectively inlining the UDF. | ||
|
|
||
| ### Rewriting with `transform` | ||
|
|
||
| To implement the inlining, we'll need to write a function that takes an `Expr` and returns a `Result<Expr>`. If the expression is _not_ to be rewritten `Transformed::No` is used to wrap the original `Expr`. If the expression _is_ to be rewritten, `Transformed::Yes` is used to wrap the new `Expr`. | ||
|
|
||
| ```rust | ||
| fn rewrite_add_one(expr: Expr) -> Result<Expr> { | ||
| expr.transform(&|expr| { | ||
| Ok(match expr { | ||
| Expr::ScalarUDF(scalar_fun) if scalar_fun.fun.name == "add_one" => { | ||
| let input_arg = scalar_fun.args[0].clone(); | ||
| let new_expression = input_arg + lit(1i64); | ||
|
|
||
| Transformed::Yes(new_expression) | ||
| } | ||
| _ => Transformed::No(expr), | ||
| }) | ||
| }) | ||
| } | ||
| ``` | ||
|
|
||
| ### Creating an `OptimizerRule` | ||
|
|
||
| In DataFusion, an `OptimizerRule` is a trait that supports rewriting`Expr`s that appear in various parts of the `LogicalPlan`. It follows DataFusion's general mantra of trait implementations to drive behavior. | ||
|
|
||
| We'll call our rule `AddOneInliner` and implement the `OptimizerRule` trait. The `OptimizerRule` trait has two methods: | ||
|
|
||
| - `name` - returns the name of the rule | ||
| - `try_optimize` - takes a `LogicalPlan` and returns an `Option<LogicalPlan>`. If the rule is able to optimize the plan, it returns `Some(LogicalPlan)` with the optimized plan. If the rule is not able to optimize the plan, it returns `None`. | ||
|
|
||
| ```rust | ||
| struct AddOneInliner {} | ||
|
|
||
| impl OptimizerRule for AddOneInliner { | ||
| fn name(&self) -> &str { | ||
| "add_one_inliner" | ||
| } | ||
|
|
||
| fn try_optimize( | ||
| &self, | ||
| plan: &LogicalPlan, | ||
| config: &dyn OptimizerConfig, | ||
| ) -> Result<Option<LogicalPlan>> { | ||
| // Map over the expressions and rewrite them | ||
| let new_expressions = plan | ||
| .expressions() | ||
| .into_iter() | ||
| .map(|expr| rewrite_add_one(expr)) | ||
| .collect::<Result<Vec<_>>>()?; | ||
|
|
||
| let inputs = plan.inputs().into_iter().cloned().collect::<Vec<_>>(); | ||
|
|
||
| let plan = plan.with_new_exprs(&new_expressions, &inputs); | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 |
||
|
|
||
| plan.map(Some) | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| Note the use of `rewrite_add_one` which is mapped over `plan.expressions()` to rewrite the expressions, then `plan.with_new_exprs` is used to create a new `LogicalPlan` with the rewritten expressions. | ||
|
|
||
| We're almost there. Let's just test our rule works properly. | ||
|
|
||
| ## Testing the Rule | ||
|
|
||
| Testing the rule is fairly simple, we can create a SessionState with our rule and then create a DataFrame and run a query. The logical plan will be optimized by our rule. | ||
|
|
||
| ```rust | ||
| use datafusion::prelude::*; | ||
|
|
||
| let rules = Arc::new(AddOneInliner {}); | ||
| let state = ctx.state().with_optimizer_rules(vec![rules]); | ||
|
|
||
| let ctx = SessionContext::with_state(state); | ||
| ctx.register_udf(add_one); | ||
|
|
||
| let sql = "SELECT add_one(1) AS added_one"; | ||
| let plan = ctx.sql(sql).await?.logical_plan(); | ||
|
|
||
| println!("{:?}", plan); | ||
| ``` | ||
|
|
||
| This results in the following output: | ||
|
|
||
| ```text | ||
| Projection: Int64(1) + Int64(1) AS added_one | ||
| EmptyRelation | ||
| ``` | ||
|
|
||
| I.e. the `add_one` UDF has been inlined into the projection. | ||
|
|
||
| ## Conclusion | ||
|
|
||
| In this guide, we've seen how to create `Expr`s programmatically and how to rewrite them. This is useful for simplifying and optimizing `Expr`s. We've also seen how to test our rule to ensure it works properly. | ||
tshauck marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.