[SPARK-36073][SQL] EquivalentExpressions fixes and improvements #33281

peter-toth · 2021-07-09T12:18:20Z

What changes were proposed in this pull request?

This PR:

Fixes the performance issue mentioned in https://github.com/apache/spark/pull/32559/files#r633488455 and partially fixed in https://github.com/apache/spark/pull/33142/files#r660897248 using a new option to remove expressions from equivalence maps.

Fixes a bug with identifying common expressions in conditional expressions (a side effect of the above new approach). After this PR, add will be common subexpression in the following example:

val add = Add(Literal(1), Literal(2))
val ifExpr1 = If(Literal(true), add, Literal(3))
val ifExpr3 = If(GreaterThan(add, Literal(4)), Add(ifExpr1, add), Multiply(ifExpr1, add))
var equivalence = new EquivalentExpressions
equivalence.addExprTree(ifExpr3)

Fixes a bug of transparently canonicalized expressions (like PromotePrecision) are considered common subexpressions.
After this PR, transparent will not be common subexpression in the following example:
```
val add = Add(Literal(1), Literal(2))
val transparent = PromotePrecision(add)
var equivalence = new EquivalentExpressions
equivalence.addExprTree(transparent)
```

Why are the changes needed?

Bugfix + performance improvement.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing + new UTs.

SparkQA · 2021-07-09T13:08:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45363/

SparkQA · 2021-07-09T13:43:53Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45363/

SparkQA · 2021-07-09T16:19:29Z

Test build #140857 has started for PR 33281 at commit 7786c0c.

SparkQA · 2021-07-09T16:43:21Z

Test build #140852 has finished for PR 33281 at commit 102942c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ExpressionStats(expr: Expression)(var useCount: Int)

SparkQA · 2021-07-09T16:59:43Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45368/

peter-toth · 2021-07-10T09:43:26Z

cc @cloud-fan, @Kimahriman, @maropu, @viirya

Kimahriman · 2021-07-10T13:58:56Z

...src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala

What was the behavior of this?

This test actually proves that the new updateExprTree(stats.expr, localEquivalenceMap, -stats.useCount) approach fixes a bug as well.
Before this PR equivalence.getAllExprStates(1) didn't return anything because the notChild filter at https://github.com/apache/spark/pull/33281/files#diff-4d8c210a38fc808fef3e5c966b438591f225daa3c9fd69359446b94c351aa11eL90-L92 filtered out all child expressions (including add) of the common expression ifExpr1. But it should filter out only children "defined by" childrenToRecurse() and commonChildrenToRecurse(). In this PR when we remove ifExpr1 from localEquivalenceMap we keep add in localEquivalenceMap. Then we add add to map (in the 2nd iteration of the loop) and so add will have useCount = 2 in the end.

Kimahriman · 2021-07-10T13:59:51Z

...src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala

Does transparently canonicalized mean PromotePrecision(add).canonicalized == add.canonicalized?

Yes, maybe I could rephrase this if it doesn't make sense.

viirya · 2021-07-10T16:00:48Z

Fixes the performance issue mentioned in https://github.com/apache/spark/pull/32559/files#r633488455 and partially fixed in https://github.com/apache/spark/pull/33142/files#r660897248 using a new option to remove expressions from equivalence maps.
Fixes an issue with identifying common expressions in conditional expressions (a side effect of the above).
Fixes the issue of transparently canonicalized expressions (like PromotePrecision) are considered common subexpressions.

Could you explain more about the three points listed in the description?

viirya · 2021-07-10T16:22:18Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

Oh, I got your idea here. For performance, is there significant difference? The current approach is simpler. The filtering in localEquivalenceMap is based on height so it should be very fast. Is this still a performance bottleneck?

My concern is that this change complicates the computation of useCount. It will be harder to debug in the future. Before we are not certain that this is a real performance bottleneck, this may look like a premature optimization.

Honestly I'm not sure how significant the difference is. I think if we have deep expressions in localEquivalenceMap then filtering by height (before this PR) might not help a lot.
The new code in this PR might look a bit complicated at first, but actually it is very simple, we just remove expressions from the localEquivalenceMap with the reverse of addExprTree().

This new approach also fixes a bug tested here: https://github.com/apache/spark/pull/33281/files#r667343014

I see. I'd rather consider it as an improvement as it doesn't cause query failure or codegen failure, though it fails to identify a common subexpression.

That's said, we don't need to hurry on this for 3.2.

All right, I've changed ticket type to improvement.

peter-toth · 2021-07-10T17:46:29Z

Fixes the performance issue mentioned in https://github.com/apache/spark/pull/32559/files#r633488455 and partially fixed in https://github.com/apache/spark/pull/33142/files#r660897248 using a new option to remove expressions from equivalence maps.
Fixes an issue with identifying common expressions in conditional expressions (a side effect of the above).
Fixes the issue of transparently canonicalized expressions (like PromotePrecision) are considered common subexpressions.

Could you explain more about the three points listed in the description?

Sure. I've updated the description and added some examples.
If we consider this PR too complex then maybe I should split the 2 commits into 2 separate PRs. The first commit belongs to the first 2 points, the second commit belongs to the 3rd.

peter-toth · 2021-07-28T08:40:12Z

Gentle ping @viirya, @cloud-fan.

peter-toth · 2021-10-05T11:40:44Z

@viirya, @cloud-fan I wonder if we can move forward with this?

SparkQA · 2021-10-29T04:11:40Z

Test build #144726 has finished for PR 33281 at commit 7786c0c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ExpressionStats(expr: Expression)(var useCount: Int)

allisonwang-db · 2021-11-05T21:35:40Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

Let's follow SPARK-33539 and add a new method in QueryExecutionErrors.

Thanks @allisonwang-db. Added it in 227cad1

…prs of conditional expressions

SparkQA · 2021-11-09T16:59:19Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49506/

SparkQA · 2021-11-09T17:47:34Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49506/

SparkQA · 2021-11-09T20:48:32Z

Test build #145034 has finished for PR 33281 at commit 227cad1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-11-10T14:18:00Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+            map -= wrapper
+            false
+          } else {
+            // Should not happen


If this should not happen, I think throwing IllegalStateException is better as it's a bug. QueryExecutionErrors is for user-facing errors.

Throwing IllegalStateException looks reasonable to me. Fixed in c7c7016

cloud-fan · 2021-11-10T14:18:30Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+            map.put(wrapper, ExpressionStats(expr)(useCount))
+          } else {
+            // Should not happen
+            throw QueryExecutionErrors.updateEquivalentExpressionsError(expr, map, useCount)


Fixed in c7c7016

SparkQA · 2021-11-10T16:53:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49538/

SparkQA · 2021-11-10T17:52:52Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49538/

SparkQA · 2021-11-10T20:50:28Z

Test build #145069 has finished for PR 33281 at commit c7c7016.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-11-11T04:49:08Z

thanks, merging to master!

peter-toth · 2021-11-11T08:25:02Z

Thanks for the review @cloud-fan!

github-actions bot added the SQL label Jul 9, 2021

peter-toth changed the title ~~[SPARK-36073][SQL] SubExpr elimination should include common child exprs of conditional expressions~~ [WIP][SPARK-36073][SQL] SubExpr elimination should include common child exprs of conditional expressions Jul 9, 2021

peter-toth changed the title ~~[WIP][SPARK-36073][SQL] SubExpr elimination should include common child exprs of conditional expressions~~ [SPARK-36073][SQL] EquivalentExpressions fixes and improvement Jul 10, 2021

peter-toth changed the title ~~[SPARK-36073][SQL] EquivalentExpressions fixes and improvement~~ [SPARK-36073][SQL] EquivalentExpressions fixes and improvements Jul 10, 2021

peter-toth mentioned this pull request Jul 10, 2021

[SPARK-35940][SQL] Refactor EquivalentExpressions to make it more efficient #33142

Closed

Kimahriman reviewed Jul 10, 2021

View reviewed changes

viirya reviewed Jul 10, 2021

View reviewed changes

allisonwang-db reviewed Nov 5, 2021

View reviewed changes

peter-toth added 3 commits November 9, 2021 13:46

[SPARK-36073][SQL] SubExpr elimination should include common child ex…

d454b06

…prs of conditional expressions

fix transparently canonicalized expressions

dc9dc31

fix comments and review findings

227cad1

peter-toth force-pushed the SPARK-36073-sub-expr-common-children branch from 7786c0c to 227cad1 Compare November 9, 2021 15:38

cloud-fan approved these changes Nov 10, 2021

View reviewed changes

fix exception

c7c7016

peter-toth force-pushed the SPARK-36073-sub-expr-common-children branch from 7c4d506 to c7c7016 Compare November 10, 2021 15:57

cloud-fan closed this in f153029 Nov 11, 2021

[SPARK-36073][SQL] EquivalentExpressions fixes and improvements #33281

[SPARK-36073][SQL] EquivalentExpressions fixes and improvements #33281

Uh oh!

Conversation

peter-toth commented Jul 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jul 9, 2021

Uh oh!

SparkQA commented Jul 9, 2021

Uh oh!

SparkQA commented Jul 9, 2021

Uh oh!

SparkQA commented Jul 9, 2021

Uh oh!

SparkQA commented Jul 9, 2021

Uh oh!

peter-toth commented Jul 10, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth Jul 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Jul 10, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth Jul 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth commented Jul 10, 2021

Uh oh!

peter-toth commented Jul 28, 2021

Uh oh!

peter-toth commented Oct 5, 2021

Uh oh!

SparkQA commented Oct 29, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 9, 2021

Uh oh!

SparkQA commented Nov 9, 2021

Uh oh!

SparkQA commented Nov 9, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth Nov 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth Nov 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 10, 2021

Uh oh!

SparkQA commented Nov 10, 2021

peter-toth commented Jul 9, 2021 •

edited

Loading

peter-toth Jul 10, 2021 •

edited

Loading

peter-toth Jul 10, 2021 •

edited

Loading

peter-toth Nov 10, 2021 •

edited

Loading

peter-toth Nov 10, 2021 •

edited

Loading