-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-35564][SQL] Improve subexpression elimination #41677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-35564][SQL] Improve subexpression elimination #41677
Conversation
e7bc27a to
7a405a9
Compare
|
This PR is still WIP because I want to add more tests, but @Kimahriman, @cloud-fan, @rednaxelafx, @ulysses-you, @viirya, @wankunde you might be interrested... |
|
This hurts my brain thinking about probabilistic conditional evaluations, and I feel like the subexpression elimination logic is already overly complicated. If I wanted to just create subexpressions for anything that is definitely executed once and maybe executed one other time (regardless of how nested inside a CaseWhen or Coalesce operation), what do I even set the new setting to? |
Got it, thanks for your feedback. If Currently the new config is used as |
|
If you just have |
No, I mean although I removed |
| // But we can continue the previous logic further because if `w2` is evaluated, then based | ||
| // on the result of `w2` either `t2` or `w3` is also evaluated. | ||
| // So eventually the local equivalence map can be calculated as | ||
| // `W1 | (T1 & W2 | T2 & (W3 | T3 & ... & (Wn | Tn & E)))`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a missing parenthesis here? I'm trying to understand the order of operations once you get to T2.
Is it W1 | T1 & (W2 | T2 & (W3 | ...
assuming normal higher precedence of &
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I fixed the comment in bebfa21 to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+ I've changed the new config in: 38a0996 and from now the default 0 value has the same effect as spark.sql.subexpressionElimination.conditionals.enabled in your PR.
|
The failure in |
|
I've added a few more test cases and this PR is now ready for review. |
9d88d8e to
c1576f1
Compare
| .filter(_ >= 0d), | ||
| allowLeafExpressions: Boolean = false) { | ||
|
|
||
| // The subexpressions are stored by height to speed up certain calculations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorted by height?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maps is an array buffer and each element stores a map that contains expressions with certain height. The ith element contains a map of expressions with height i+1.
| * | ||
| * Please note that `EquivalentExpressions` is mainly used in subexpression elimination where common | ||
| * non-leaf expression subtrees are calculated, but there there is one special use case in | ||
| * `PhysicalAggregation` where `EquivalentExpressions` is used as a mutable set of non-deterministic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, how can EquivalentExpressions handle non-deterministic expressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is not right. I should have written mutable set of deterministic expressions
| * Adds each expression to this data structure and returns true if there was already a matching | ||
| * expression. | ||
| */ | ||
| def addExpr(expr: Expression): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like the only difference between this and addExprTree is, addExprTree allows non-deterministic expression. Shall we name these two methods better?
| * A wrapper in place of using Seq[Expression] to record a group of equivalent expressions. | ||
| * This class stores the expected evaluation count of expressions split into `evalCount` + | ||
| * `realEvalCount` that records sure evaluations and `condEvalCount` + `realCondEvalCount` that | ||
| * records conditional evaluations. The `real...` fields are filled up during `inflate()`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite get the meaning of the real... fields by reading the comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like "direct eval count" and "transitive eval count". e.g. if we addExpr(a + 1), then the "direct eval count" of a + b is 1, and the "transive eval count" of a is 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, direct and transitive are better prefixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw transitive (currently called real) is actually a sum of the direct additions and the transitive additions from parent expressions.
If we addExprTree(a + b) and addExpr((a + b) + 1) then the direct of a + b is 1 and the transitive of a + b is 2.
We wouldn't need the transitive fields if we did recurse during addExprTree().
But you know, the previous version of EquivalentExpressions used useCount. And with useCount when same or overlapping expressions were added to the data structure the second addExprTree() didn't fully recurse, but it stopped when the first common subexpression was found. Now with the new structure, we just record the direct additions during addExprTree() and fill the transitives during inflate() to be par on with the old version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunaltey the intersect and union operation are not possible with the old "compressed" useCount and that's why we need evalCounts.
| case _: LambdaVariable => true | ||
| private def inflateExprState(exprStats: ExpressionStats): Unit = { | ||
| val expr = exprStats.expr | ||
| if (!expr.isInstanceOf[LeafExpression] || allowLeafExpressions) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what will go wrong if we always allow leaf expressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to store leaf expressions when we use EquivalentExpressions for CSE as leafs don't make sense to evaluate in advance. addExprTree() didn't recurse to leafs in the previous version but now that addExprTree() doesn't recurse, we need this flag.
| } else { | ||
| (otherValue.realEvalCount, value.realEvalCount) | ||
| } | ||
| value.realCondEvalCount += otherValue.realCondEvalCount + max - min |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be value.realCondEvalCount = (value.realCondEvalCount + otherValue.realCondEvalCount) / 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value.realCondEvalCount = (value.realCondEvalCount + otherValue.realCondEvalCount + max - min) / 2 is the full calculation, but the value.realCondEvalCount /= 2 extracted a bit below.
The max - min / 2 is also need. E.g. if we have If(_, a + b, (a + b) + (a + b)) then during the intersect of the then and else branches we have a + b -> 1 + 0 in then and a + b -> 2 + 0 in else. The result should be a + b -> 1 + 0.5 (sure + conditional).
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
|
@cloud-fan, @Kimahriman please let me know if we should take this PR further, otherwise I let this PR closed by the automation. |
What changes were proposed in this pull request?
This PR proposes a new way to do subexpression elimination in
EquivalentExpressions. The main change of the PR is thatExpressionStatsstores the expected evaluation count of subexpressions split intoevalCountthat records sure evaluations andconditionalEvalCountthat records expected conditional evaluations.Please note that the expected conditional evaluation count is not the same as how many times a subexpression appears in
EquivalentExpressionsconditionally (conditional use count). The expected conditional evaluation count better describes how likely a conditional subexpression is evaluated and so we can use it easier to define when we should consider a subexpression for elimination (e.g. some kind of threshold is reached).The idea behind using expected conditional evaluation count is that if we consider 2 cases where a
cnon-leaf subexpression:IfexpressionCaseWhenexpression,then the conditional use count is 1 in both cases. But the expected conditional evaluation counts are different. Very likely
cwill be evaluated more in the first case if we consider random input data. Since we don't know the exact probabilities of the branches, for the sake of simplicity all branchings are modelled with 0.5 / 0.5 probabilities in this PR.Please find a related conversation here about the default 0 value of the config: #32987 (comment)
Here are a few example expressions and the
ExpressionStats(sure + expected conditional evaluation counts) of a non-leafcsubexpression from the equivalence maps built from the expressions:ExpressionStatsofccc -> (1 + 0.0)c + cc -> (2 + 0.0)If(_, c, _)c -> (0 + 0.5)If(_, c + c, _)c -> (0 + 1.0)If(_, c, c)c -> (1 + 0.0)If(c, c, _)c -> (1 + 0.5)If(c, c, c)c -> (2 + 0.0)This PR:
spark.sql.subexpressionElimination.minExpectedConditionalEvaluationCountconfig.CaseWhenandCoalesceexpressions. Branch groups were used for calculating common subexpressions in conditional branches based on the idea that subexpressions that appear in all elements of a group are surely evaluated once. If we take theCaseWhen(w1, t1, w2, t2, w3, t3, e)example then the previously defined (t1,t2,t3,e) group made sense, but for some reason the (w1,w2,w3) group was also defined, which didn't make sense becausew1was also considered always evaluated. Also, some other groups that would have made sense (t1,w2) and (t1,t2,w3) were not defined. This PR completely removes branch groups fromConditionalExpressionand uses a new way to calculate surely evaluated subexpressions.Why are the changes needed?
Improve subexpression elimination.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing and new UTs (including the ones from @Kimahriman's PR: #32987).