[SPARK-35940][SQL] Refactor EquivalentExpressions to make it more efficient #33142

cloud-fan · 2021-06-29T19:14:34Z

What changes were proposed in this pull request?

This PR uses 2 ideas to make EquivalentExpressions more efficient:

do not keep all the equivalent expressions, we only need a count
track the "height" of common subexpressions, to quickly do child-parent sort, and filter out non-child expressions in addCommonExprs

This PR also fixes several small bugs (exposed by the refactoring), please see PR comments.

Why are the changes needed?

code cleanup and small perf improvement

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

cloud-fan · 2021-06-29T19:16:38Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

This partially solves the perf issue mentioned in https://github.com/apache/spark/pull/32559/files#r633488455

By filtering with height first, we can reduce the data to iterate.

I opened #33281 to improve it further.

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

cloud-fan · 2021-06-29T19:21:03Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

not sure we can trigger this bug with some real queries, but it's an obvious bug to me.

good catch!

I think if we wrongly recurse into the children of CodegenFallback, it only produces unused subexpressions. Some redundant generated codes, i.e..

Is it better to backport this part into branch-3.1/3.0?

yea will do

cloud-fan · 2021-06-29T19:21:45Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

This fixes #30245 (comment)

Basically it takes all the conditions as the commonChildrenToRecurse, so that we only get the common expressions that appear in all the conditions.

@Kimahriman I think this fix works? The only drawback is, if there are common subexpressions among the conditions, they will always be counted as "appear twice" and gets codegened into methods.

I think the perf overhead is really small, and if the first condition is false, we evaluate the next condition which gives perf improvement because of common subexpressions elimination.

For the value branches of CaseWhen, I don't touch them in this PR.

Yeah this definitely fixes a potential bug of creating subexpressions for things that are never evaluated, same with the coalesce update. I think the values are already handled fine, it's just the conditionals that had an issue with short circuiting

This fixed #30245 (comment).

The only drawback is, if there are common subexpressions among the conditions, they will always be counted as "appear twice" and gets codegened into methods.

I just don't get this. You mean for If(a + b > 1, 1, a + b + c > 1, 2, a + b + c > 2, 3), a + b + c will be counted twice and considered as common subexpression?

I think he means in CaseWhen(a + b > 1, 1, a + b + c > 1, 2), a + b will be a subexpression even though it might only be executed once.

But CaseWhen(a + b > 1, 1, a + b + c > 1, 2, a + b + c > 0, 3), a + b + c won't even be considered for a subexpression if it's seen elsewhere, which was the bug if CaseWhen supports short circuiting

Yes, because the first condition of CaseWhen is in both childrenToRecurse and commonChildrenToRecurse

cloud-fan · 2021-06-29T19:31:39Z

cc @viirya @maropu

SparkQA · 2021-06-29T19:55:24Z

Test build #140399 has finished for PR 33142 at commit e1362da.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ExpressionEquals(e: Expression)
case class ExpressionStats(expr: Expression)(

viirya · 2021-06-29T22:55:25Z

track the "height" of common subexpressions, to quickly do child-parent sort.

About this, I think the sorting is not reliable as it is hard to do child-parent sort. I have another proposal to get rid of the sort as I mentioned before.

cloud-fan · 2021-06-30T03:39:40Z

Can you briefly introduce your idea? Sorting by height is stable and fast now.

And I need the height anyway in #33142 (comment)

viirya · 2021-06-30T05:20:53Z

Can you briefly introduce your idea? Sorting by height is stable and fast now.

Basically, the steps are:

Propagate the SubExprEliminationState map for all subexprs (no needed to be sorted). Only create the value and isNull variables, don't do codegen yet.
Iterate all subexprs to do codegen. Because expression codegen will look at the map to replace subexprs, any subexpr in children will be replaced and chained. So we don't need to sort subexprs in advance.

SparkQA · 2021-06-30T14:42:07Z

Test build #140441 has finished for PR 33142 at commit f925cc9.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ExpressionEquals(e: Expression)
case class ExpressionStats(expr: Expression)(
case class SubExprEliminationState(eval: ExprCode, children: Seq[SubExprEliminationState])

SparkQA · 2021-06-30T15:21:16Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44955/

viirya · 2021-06-30T18:29:45Z

Can you briefly introduce your idea? Sorting by height is stable and fast now.

I've not looked in the details yet. Is sorting by height guaranteed to sort expressions by child-parent? I said current sorting is not reliable because it might miss some cases probably. It is because two expressions with no child-parent relation has no clear comparison order. So sorting is somehow unreliable for expressions. Does sorting by height solve it?

maropu · 2021-07-01T01:33:37Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+ * Instead of appending to a mutable list/buffer of Expressions, just update the "flattened"
+ * useCount in this wrapper in-place.
+ *
+ * This also tracks the "height" of the expression, so that we can return expressions with smaller


the "height" of the expression -> track the "height" of common subexpressions?

maropu · 2021-07-01T01:47:03Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+    if (!skip && !addExprToMap(expr, map)) {
+      val height = childrenToRecurse(expr).map(addExprTree0(_, map))
+        .reduceOption(_ max _).map(_ + 1).getOrElse(0)
+      map(ExpressionEquals(expr)).height = height


(My comment is the same with the @viirya one) we can always judge if an expr is a parent of another expr or not from this height? It seems this height depends on a map state, so a true height value can change after the assignment? For this purpose, we cannot simply use the height of an expression instead?

Given that @viirya will remove the sort, I think this issue doesn't matter now (not worse than before). In addCommonExprs, I only use this height to do filtering, which should be OK.

For this purpose, we cannot simply use the height of an expression instead?

makes sense. A true height (from root to the furthest leaf) is good enough to quickly check child-parent relationship.

maropu · 2021-07-01T01:48:36Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

Is it better to backport this part into branch-3.1/3.0?

cloud-fan · 2021-07-01T02:54:04Z

So sorting is somehow unreliable for expressions. Does sorting by height solve it?

As long as the sorting algorithm is stable (i.e. retain the original input order), it's stable. It's similar to sort strings by length.

cloud-fan · 2021-07-01T03:10:48Z

...src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala

    assert(a.hashCode != b3.hashCode)
    assert(a.semanticEquals(b3))
  }



The changes in this file is to adapt the new API in EquivalentExpressions. e.g.
getAllEquivalentExprs() -> getAllExprStates
getEquivalentExprs(oneA).size == 1 -> getExprState(oneA).get.useCount == 1
getEquivalentExprs(oneA).exists(_ eq oneA) -> getExprState(oneA).exists(_.expr eq oneA)

SparkQA · 2021-07-01T04:41:07Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44992/

SparkQA · 2021-07-01T05:19:31Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44992/

SparkQA · 2021-07-01T08:41:34Z

Test build #140480 has finished for PR 33142 at commit 92786a6.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ExpressionStats(expr: Expression)(var useCount: Int = 1)

cfmcgrady · 2021-07-02T10:25:46Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

-      exprSet.intersect(otherExprSet)
+      map: mutable.HashMap[ExpressionEquals, ExpressionStats]): Unit = {
+    assert(exprs.length > 1)
+    var localEquivalenceMap = mutable.HashMap.empty[ExpressionEquals, ExpressionStats]


This also fixed that, previously, for Or(Coalesce(expr1, expr2, expr2), Coalesce(expr1, expr2, expr2)), expr2 will be extracted and considered as a common subexpression. Currently, no subexpression will be extracted.

viirya · 2021-07-03T05:31:47Z

@maropu Any more comments? Otherwise I will merge this tomorrow. Thanks.

viirya · 2021-07-03T15:28:13Z

Thanks! Merging to master/branch-3.2.

…icient ### What changes were proposed in this pull request? This PR uses 2 ideas to make `EquivalentExpressions` more efficient: 1. do not keep all the equivalent expressions, we only need a count 2. track the "height" of common subexpressions, to quickly do child-parent sort, and filter out non-child expressions in `addCommonExprs` This PR also fixes several small bugs (exposed by the refactoring), please see PR comments. ### Why are the changes needed? code cleanup and small perf improvement ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #33142 from cloud-fan/codegen. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]> (cherry picked from commit e6ce220) Signed-off-by: Liang-Chi Hsieh <[email protected]>

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

github-actions bot added the SQL label Jun 29, 2021

cloud-fan commented Jun 29, 2021

View reviewed changes

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala Outdated Show resolved Hide resolved

cloud-fan commented Jun 29, 2021

View reviewed changes

This was referenced Jun 30, 2021

[SPARK-33337][SQL] Support subexpression elimination in branches of conditional expressions #30245

Closed

[SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions #32987

Open

cloud-fan force-pushed the codegen branch from e1362da to 28371e8 Compare June 30, 2021 14:21

Refactor EquivalentExpressions to make it more efficient

f925cc9

cloud-fan force-pushed the codegen branch from 28371e8 to f925cc9 Compare June 30, 2021 14:30

maropu reviewed Jul 1, 2021

View reviewed changes

address comments

92786a6

cloud-fan force-pushed the codegen branch from 97f92e8 to 92786a6 Compare July 1, 2021 03:03

cloud-fan commented Jul 1, 2021

View reviewed changes

cfmcgrady reviewed Jul 2, 2021

View reviewed changes

viirya approved these changes Jul 3, 2021

View reviewed changes

viirya closed this in e6ce220 Jul 3, 2021

HyukjinKwon reviewed Jul 12, 2021

View reviewed changes

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala Show resolved Hide resolved

[SPARK-35940][SQL] Refactor EquivalentExpressions to make it more efficient #33142

[SPARK-35940][SQL] Refactor EquivalentExpressions to make it more efficient #33142

Uh oh!

Conversation

cloud-fan commented Jun 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kimahriman Jun 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jun 29, 2021

Uh oh!

SparkQA commented Jun 29, 2021

Uh oh!

viirya commented Jun 29, 2021

Uh oh!

cloud-fan commented Jun 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Jun 30, 2021

Uh oh!

SparkQA commented Jun 30, 2021

Uh oh!

SparkQA commented Jun 30, 2021

Uh oh!

viirya commented Jun 30, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 1, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 1, 2021

Uh oh!

SparkQA commented Jul 1, 2021

cloud-fan commented Jun 29, 2021 •

edited

Loading

cloud-fan Jun 29, 2021 •

edited

Loading

Kimahriman Jun 30, 2021 •

edited

Loading

cloud-fan commented Jun 30, 2021 •

edited

Loading

cfmcgrady Jul 2, 2021 •

edited

Loading