[SPARK-35439][SQL] Children subexpr should come first than parent subexpr #32586

viirya · 2021-05-19T03:46:08Z

What changes were proposed in this pull request?

This patch sorts equivalent expressions based on their child-parent relation.

Why are the changes needed?

EquivalentExpressions maintains a map of equivalent expressions. It is HashMap now so the insertion order is not guaranteed to be preserved later. Subexpression elimination relies on retrieving subexpressions from the map. If there is child-parent relationships among the subexpressions, we want the child expressions come first than parent expressions, so we can replace child expressions in parent expressions with subexpression evaluation.

For example, we have two different expressions Add(Literal(1), Literal(2)) and Add(Literal(3), add).

Case 1: child subexpr comes first.

addExprTree(add)
addExprTree(Add(Literal(3), add))
addExprTree(Add(Literal(3), add))

Case 2: parent subexpr comes first. For this case, we need to sort equivalent expressions.

addExprTree(Add(Literal(3), add))  => We add `Add(Literal(3), add)` into the map first, then add `add` into the map
addExprTree(add)
addExprTree(Add(Literal(3), add))

As we are going to sort equivalent expressions at all, we don't need LinkedHashMap but just do sorting.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added tests.

…serted.

viirya · 2021-05-19T03:46:22Z

cc @cloud-fan @maropu @dongjoon-hyun

dongjoon-hyun

+1, LGTM.

maropu · 2021-05-19T04:07:32Z

The fix looks fine. Is it difficult to add some tests for that case?

SparkQA · 2021-05-19T04:39:08Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43221/

viirya · 2021-05-19T04:43:31Z

I figured out this change makes sense. But the description is not correct. I will update it later.

viirya · 2021-05-19T04:49:10Z

The fix looks fine. Is it difficult to add some tests for that case?

I don't come out a test that fails before but succeeds after this. I think the retrieving order is okay during my test. But it is not guaranteed.

SparkQA · 2021-05-19T05:12:02Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43221/

viirya · 2021-05-19T05:46:56Z

Hmm, I found corner case that LinkedHashMap doesn't work here. Going to update and adding test case.

viirya · 2021-05-19T05:57:52Z

Please take another look. I found corner case and added a test case. Thanks. cc @cloud-fan @maropu @dongjoon-hyun

SparkQA · 2021-05-19T06:34:50Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43224/

maropu · 2021-05-19T07:24:50Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+
+  /**
+   * Orders [Expression] by parent/child relations. The child expression is smaller
+   * than parent expression.


If there is child-parent relationships among the subexpressions, we want the child expressions come first than parent expressions, so we can replace child expressions in parent expressions with subexpression evaluation.

How about leaving this comment here? I think this explanation looks clearer.

SparkQA · 2021-05-19T08:07:31Z

Test build #138700 has finished for PR 32586 at commit 7b6b589.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-19T08:48:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43225/

SparkQA · 2021-05-19T10:19:52Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43225/

SparkQA · 2021-05-19T10:43:47Z

Test build #138703 has finished for PR 32586 at commit f777855.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ExpressionOrdering extends Ordering[Expression]

SparkQA · 2021-05-19T12:27:56Z

Test build #138704 has finished for PR 32586 at commit 71b67cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-05-19T17:55:08Z

If no more comments, I will merge this later today. Thanks.

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

cloud-fan · 2021-05-19T18:42:05Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala


  // For each expression, the set of equivalent expressions.
-  private val equivalenceMap = mutable.HashMap.empty[Expr, mutable.ArrayBuffer[Expression]]
+  private val equivalenceMap = mutable.LinkedHashMap.empty[Expr, mutable.ArrayBuffer[Expression]]


after the new change, does it still need to be LinkedHashMap?

yea, can be reverted back to HashMap, if we are going to sort it at all.

SparkQA · 2021-05-19T23:08:03Z

Test build #138719 has finished for PR 32586 at commit c143ce2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-19T23:15:20Z

Test build #138718 has finished for PR 32586 at commit 55bc9ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ExpressionContainmentOrdering extends Ordering[Expression]

cloud-fan · 2021-05-20T06:59:50Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+    override def compare(x: Expression, y: Expression): Int = {
+      if (x.semanticEquals(y)) {
+        0
+      } else if (x.find(_.semanticEquals(y)).isDefined) {


can we run TPCDSQuerySuite and see the time of the query compilation phase? This looks like a very expensive sort.

Ok. Let me compare before/after this PR.

BTW, I think better approach is to sort after filter (e.g. size > 1 in most use-case), because the number of sub-exprs should be smaller.

I changed the call usage of getAllEquivalentExprs. So we filter it first and then do sorting.

Ran TPCDSQuerySuite.

Before (master):

23.233160578 seconds 22.501728011 seconds 23.547332524 seconds

After:

23.995751468 seconds 22.262832936 seconds 21.503776059 seconds

I don't see significant difference there.

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

HyukjinKwon · 2021-05-20T10:14:14Z

nit but maybe we should also update the PR description:

Replacing HashMap with LinkedHashMap can deal with it.

dongjoon-hyun · 2021-05-20T20:33:09Z

Ya, +1 for @HyukjinKwon 's comment, too.

viirya · 2021-05-20T20:55:07Z

Thanks @HyukjinKwon @dongjoon-hyun. Yea, updated the PR description.

…essions/EquivalentExpressions.scala Co-authored-by: Hyukjin Kwon <[email protected]>

…listhashmap

SparkQA · 2021-05-20T23:08:26Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43290/

SparkQA · 2021-05-20T23:48:41Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43290/

SparkQA · 2021-05-20T23:50:34Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43293/

SparkQA · 2021-05-21T00:26:40Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43293/

SparkQA · 2021-05-21T02:42:44Z

Test build #138767 has finished for PR 32586 at commit b517df8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-21T03:46:37Z

Test build #138770 has finished for PR 32586 at commit 3819bf3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-05-21T07:49:36Z

Any more concerns or comments around this improvement? Thanks for review.

viirya · 2021-05-21T17:47:05Z

Thanks all! Merging to master.

Kimahriman · 2021-06-08T13:31:26Z

Tested this out yesterday and ran into issues occasionally with

java.lang.IllegalArgumentException: Comparison method violates its general contract!

I guess even though we don't care about the order of non-overlapping subexpressions, the comparator still needs to satisfy certain properties for the sort to work: https://docs.oracle.com/javase/8/docs/api/java/util/Comparator.html#compare-T-T-

Would it just make sense to sort based on number of nodes in the tree? I would think a subexpression could only contain another subexpression if it has more expressions in the tree, not sure if there are any weird cases that's not true. Alternatively would only returning 1 or -1 for x.find(_.semanticEquals(y)).isDefined and y.find(_.semanticEquals(x)).isDefined and 0 otherwise fix it? Not sure if that's still properly transitive and other necessary comparator properties

viirya · 2021-06-08T16:46:44Z

Yea, seems like. For non parent-child expressions, the order seems not related, but just need to make it consistent. Will submit a follow up to fix it.

… sort unrelated expressions ### What changes were proposed in this pull request? This is a followup of #32586. We introduced `ExpressionContainmentOrdering` to sort common expressions according to their parent-child relations. For unrelated expressions, previously the ordering returns -1 which is not correct and can possibly lead to transitivity issue. ### Why are the changes needed? To fix the possible transitivity issue of `ExpressionContainmentOrdering`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #32870 from viirya/SPARK-35439-followup. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>

…expr This patch sorts equivalent expressions based on their child-parent relation. `EquivalentExpressions` maintains a map of equivalent expressions. It is `HashMap` now so the insertion order is not guaranteed to be preserved later. Subexpression elimination relies on retrieving subexpressions from the map. If there is child-parent relationships among the subexpressions, we want the child expressions come first than parent expressions, so we can replace child expressions in parent expressions with subexpression evaluation. For example, we have two different expressions `Add(Literal(1), Literal(2))` and `Add(Literal(3), add)`. Case 1: child subexpr comes first. ```scala addExprTree(add) addExprTree(Add(Literal(3), add)) addExprTree(Add(Literal(3), add)) ``` Case 2: parent subexpr comes first. For this case, we need to sort equivalent expressions. ``` addExprTree(Add(Literal(3), add)) => We add `Add(Literal(3), add)` into the map first, then add `add` into the map addExprTree(add) addExprTree(Add(Literal(3), add)) ``` As we are going to sort equivalent expressions at all, we don't need `LinkedHashMap` but just do sorting. No Added tests. Closes apache#32586 from viirya/use-listhashmap. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]>

Use LinkedHashMap to guarantee traversing with the order they were in…

7b6b589

…serted.

github-actions bot added the SQL label May 19, 2021

dongjoon-hyun approved these changes May 19, 2021

View reviewed changes

viirya marked this pull request as draft May 19, 2021 04:17

viirya marked this pull request as ready for review May 19, 2021 04:49

maropu approved these changes May 19, 2021

View reviewed changes

cloud-fan approved these changes May 19, 2021

View reviewed changes

fix

f777855

viirya changed the title ~~[SPARK-35439][SQL] Use LinkedHashMap to guarantee traversing with the order they were inserted~~ [SPARK-35439][SQL] Children subexpr should come first than parent subexpr May 19, 2021

maropu reviewed May 19, 2021

View reviewed changes

maropu approved these changes May 19, 2021

View reviewed changes

Add more comment.

71b67cb

dongjoon-hyun reviewed May 19, 2021

View reviewed changes

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala Show resolved Hide resolved

Rename ordering class and add more comment.

55bc9ec

cloud-fan reviewed May 19, 2021

View reviewed changes

cloud-fan reviewed May 20, 2021

View reviewed changes

HyukjinKwon reviewed May 20, 2021

View reviewed changes

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala Outdated Show resolved Hide resolved

dongjoon-hyun mentioned this pull request May 20, 2021

[SPARK-35410][SQL] SubExpr elimination should not include redundant children exprs in conditional expression #32559

Closed

viirya and others added 3 commits May 20, 2021 14:57

Update sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expr…

b517df8

…essions/EquivalentExpressions.scala Co-authored-by: Hyukjin Kwon <[email protected]>

sort after filtering.

b13bc65

Merge branch 'use-listhashmap' of github.com:viirya/spark-1 into use-…

3819bf3

…listhashmap

cloud-fan approved these changes May 21, 2021

View reviewed changes

viirya closed this in 066944c May 21, 2021

viirya mentioned this pull request Jun 10, 2021

[SPARK-35439][SQL][FOLLOWUP] ExpressionContainmentOrdering should not sort unrelated expressions #32870

Closed

viirya deleted the use-listhashmap branch December 27, 2023 18:25

[SPARK-35439][SQL] Children subexpr should come first than parent subexpr #32586

[SPARK-35439][SQL] Children subexpr should come first than parent subexpr #32586

Uh oh!

Conversation

viirya commented May 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

viirya commented May 19, 2021

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

maropu commented May 19, 2021

Uh oh!

SparkQA commented May 19, 2021

Uh oh!

viirya commented May 19, 2021

Uh oh!

viirya commented May 19, 2021

Uh oh!

SparkQA commented May 19, 2021

Uh oh!

viirya commented May 19, 2021

Uh oh!

viirya commented May 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented May 19, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 19, 2021

Uh oh!

SparkQA commented May 19, 2021

Uh oh!

SparkQA commented May 19, 2021

Uh oh!

SparkQA commented May 19, 2021

Uh oh!

SparkQA commented May 19, 2021

Uh oh!

viirya commented May 19, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya May 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 19, 2021

Uh oh!

SparkQA commented May 19, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HyukjinKwon commented May 20, 2021

Uh oh!

dongjoon-hyun commented May 20, 2021

Uh oh!

viirya commented May 20, 2021

Uh oh!

SparkQA commented May 20, 2021

Uh oh!

SparkQA commented May 20, 2021

viirya commented May 19, 2021 •

edited

Loading

viirya commented May 19, 2021 •

edited

Loading

viirya May 19, 2021 •

edited

Loading

viirya commented Jun 8, 2021 •

edited

Loading