Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented May 19, 2021

What changes were proposed in this pull request?

This patch sorts equivalent expressions based on their child-parent relation.

Why are the changes needed?

EquivalentExpressions maintains a map of equivalent expressions. It is HashMap now so the insertion order is not guaranteed to be preserved later. Subexpression elimination relies on retrieving subexpressions from the map. If there is child-parent relationships among the subexpressions, we want the child expressions come first than parent expressions, so we can replace child expressions in parent expressions with subexpression evaluation.

For example, we have two different expressions Add(Literal(1), Literal(2)) and Add(Literal(3), add).

Case 1: child subexpr comes first.

addExprTree(add)
addExprTree(Add(Literal(3), add))
addExprTree(Add(Literal(3), add))

Case 2: parent subexpr comes first. For this case, we need to sort equivalent expressions.

addExprTree(Add(Literal(3), add))  => We add `Add(Literal(3), add)` into the map first, then add `add` into the map
addExprTree(add)
addExprTree(Add(Literal(3), add))

As we are going to sort equivalent expressions at all, we don't need LinkedHashMap but just do sorting.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added tests.

@viirya
Copy link
Member Author

viirya commented May 19, 2021

cc @cloud-fan @maropu @dongjoon-hyun

@github-actions github-actions bot added the SQL label May 19, 2021
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@maropu
Copy link
Member

maropu commented May 19, 2021

The fix looks fine. Is it difficult to add some tests for that case?

@viirya viirya marked this pull request as draft May 19, 2021 04:17
@SparkQA
Copy link

SparkQA commented May 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43221/

@viirya
Copy link
Member Author

viirya commented May 19, 2021

I figured out this change makes sense. But the description is not correct. I will update it later.

@viirya
Copy link
Member Author

viirya commented May 19, 2021

The fix looks fine. Is it difficult to add some tests for that case?

I don't come out a test that fails before but succeeds after this. I think the retrieving order is okay during my test. But it is not guaranteed.

@viirya viirya marked this pull request as ready for review May 19, 2021 04:49
@SparkQA
Copy link

SparkQA commented May 19, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43221/

@viirya
Copy link
Member Author

viirya commented May 19, 2021

Hmm, I found corner case that LinkedHashMap doesn't work here. Going to update and adding test case.

@viirya viirya changed the title [SPARK-35439][SQL] Use LinkedHashMap to guarantee traversing with the order they were inserted [SPARK-35439][SQL] Children subexpr should come first than parent subexpr May 19, 2021
@viirya
Copy link
Member Author

viirya commented May 19, 2021

Please take another look. I found corner case and added a test case. Thanks. cc @cloud-fan @maropu @dongjoon-hyun

@SparkQA
Copy link

SparkQA commented May 19, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43224/


/**
* Orders [Expression] by parent/child relations. The child expression is smaller
* than parent expression.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is child-parent relationships among the subexpressions, we want the child expressions come first than parent expressions, so we can replace child expressions in parent expressions with subexpression evaluation.

How about leaving this comment here? I think this explanation looks clearer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@SparkQA
Copy link

SparkQA commented May 19, 2021

Test build #138700 has finished for PR 32586 at commit 7b6b589.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43225/

@SparkQA
Copy link

SparkQA commented May 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43225/

@SparkQA
Copy link

SparkQA commented May 19, 2021

Test build #138703 has finished for PR 32586 at commit f777855.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ExpressionOrdering extends Ordering[Expression]

@SparkQA
Copy link

SparkQA commented May 19, 2021

Test build #138704 has finished for PR 32586 at commit 71b67cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented May 19, 2021

If no more comments, I will merge this later today. Thanks.


// For each expression, the set of equivalent expressions.
private val equivalenceMap = mutable.HashMap.empty[Expr, mutable.ArrayBuffer[Expression]]
private val equivalenceMap = mutable.LinkedHashMap.empty[Expr, mutable.ArrayBuffer[Expression]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after the new change, does it still need to be LinkedHashMap?

Copy link
Member Author

@viirya viirya May 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, can be reverted back to HashMap, if we are going to sort it at all.

@SparkQA
Copy link

SparkQA commented May 19, 2021

Test build #138719 has finished for PR 32586 at commit c143ce2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 19, 2021

Test build #138718 has finished for PR 32586 at commit 55bc9ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ExpressionContainmentOrdering extends Ordering[Expression]

override def compare(x: Expression, y: Expression): Int = {
if (x.semanticEquals(y)) {
0
} else if (x.find(_.semanticEquals(y)).isDefined) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we run TPCDSQuerySuite and see the time of the query compilation phase? This looks like a very expensive sort.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Let me compare before/after this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I think better approach is to sort after filter (e.g. size > 1 in most use-case), because the number of sub-exprs should be smaller.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the call usage of getAllEquivalentExprs. So we filter it first and then do sorting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran TPCDSQuerySuite.

Before (master):

23.233160578 seconds 
22.501728011 seconds
23.547332524 seconds

After:

23.995751468 seconds 
22.262832936 seconds
21.503776059 seconds  

I don't see significant difference there.

@HyukjinKwon
Copy link
Member

nit but maybe we should also update the PR description:

Replacing HashMap with LinkedHashMap can deal with it.

@dongjoon-hyun
Copy link
Member

Ya, +1 for @HyukjinKwon 's comment, too.

@viirya
Copy link
Member Author

viirya commented May 20, 2021

Thanks @HyukjinKwon @dongjoon-hyun. Yea, updated the PR description.

@SparkQA
Copy link

SparkQA commented May 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43290/

@SparkQA
Copy link

SparkQA commented May 20, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43290/

@SparkQA
Copy link

SparkQA commented May 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43293/

@SparkQA
Copy link

SparkQA commented May 21, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43293/

@SparkQA
Copy link

SparkQA commented May 21, 2021

Test build #138767 has finished for PR 32586 at commit b517df8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 21, 2021

Test build #138770 has finished for PR 32586 at commit 3819bf3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented May 21, 2021

Any more concerns or comments around this improvement? Thanks for review.

@viirya
Copy link
Member Author

viirya commented May 21, 2021

Thanks all! Merging to master.

@viirya viirya closed this in 066944c May 21, 2021
@Kimahriman
Copy link
Contributor

Tested this out yesterday and ran into issues occasionally with

java.lang.IllegalArgumentException: Comparison method violates its general contract!

I guess even though we don't care about the order of non-overlapping subexpressions, the comparator still needs to satisfy certain properties for the sort to work: https://docs.oracle.com/javase/8/docs/api/java/util/Comparator.html#compare-T-T-

Would it just make sense to sort based on number of nodes in the tree? I would think a subexpression could only contain another subexpression if it has more expressions in the tree, not sure if there are any weird cases that's not true. Alternatively would only returning 1 or -1 for x.find(_.semanticEquals(y)).isDefined and y.find(_.semanticEquals(x)).isDefined and 0 otherwise fix it? Not sure if that's still properly transitive and other necessary comparator properties

@viirya
Copy link
Member Author

viirya commented Jun 8, 2021

Yea, seems like. For non parent-child expressions, the order seems not related, but just need to make it consistent. Will submit a follow up to fix it.

maropu pushed a commit that referenced this pull request Jun 11, 2021
… sort unrelated expressions

### What changes were proposed in this pull request?

This is a followup of #32586. We introduced `ExpressionContainmentOrdering` to sort common expressions according to their parent-child relations. For unrelated expressions, previously the ordering returns -1 which is not correct and can possibly lead to transitivity issue.

### Why are the changes needed?

To fix the possible transitivity issue of `ExpressionContainmentOrdering`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Closes #32870 from viirya/SPARK-35439-followup.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
Kimahriman pushed a commit to Kimahriman/spark that referenced this pull request Feb 22, 2022
…expr

This patch sorts equivalent expressions based on their child-parent relation.

`EquivalentExpressions` maintains a map of equivalent expressions. It is `HashMap` now so the insertion order is not guaranteed to be preserved later. Subexpression elimination relies on retrieving subexpressions from the map. If there is child-parent relationships among the subexpressions, we want the child expressions come first than parent expressions, so we can replace child expressions in parent expressions with subexpression evaluation.

For example, we have two different expressions `Add(Literal(1), Literal(2))` and `Add(Literal(3), add)`.

Case 1: child subexpr comes first.
```scala
addExprTree(add)
addExprTree(Add(Literal(3), add))
addExprTree(Add(Literal(3), add))
```

Case 2: parent subexpr comes first. For this case, we need to sort equivalent expressions.
```
addExprTree(Add(Literal(3), add))  => We add `Add(Literal(3), add)` into the map first, then add `add` into the map
addExprTree(add)
addExprTree(Add(Literal(3), add))
```

As we are going to sort equivalent expressions at all, we don't need `LinkedHashMap` but just do sorting.

No

Added tests.

Closes apache#32586 from viirya/use-listhashmap.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
@viirya viirya deleted the use-listhashmap branch December 27, 2023 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants