-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-35439][SQL] Children subexpr should come first than parent subexpr #32586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
|
The fix looks fine. Is it difficult to add some tests for that case? |
|
Kubernetes integration test starting |
|
I figured out this change makes sense. But the description is not correct. I will update it later. |
I don't come out a test that fails before but succeeds after this. I think the retrieving order is okay during my test. But it is not guaranteed. |
|
Kubernetes integration test status success |
|
Hmm, I found corner case that LinkedHashMap doesn't work here. Going to update and adding test case. |
|
Please take another look. I found corner case and added a test case. Thanks. cc @cloud-fan @maropu @dongjoon-hyun |
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
|
||
| /** | ||
| * Orders [Expression] by parent/child relations. The child expression is smaller | ||
| * than parent expression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is child-parent relationships among the subexpressions, we want the child expressions come first than parent expressions, so we can replace child expressions in parent expressions with subexpression evaluation.
How about leaving this comment here? I think this explanation looks clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
|
Test build #138700 has finished for PR 32586 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #138703 has finished for PR 32586 at commit
|
|
Test build #138704 has finished for PR 32586 at commit
|
|
If no more comments, I will merge this later today. Thanks. |
...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala
Show resolved
Hide resolved
|
|
||
| // For each expression, the set of equivalent expressions. | ||
| private val equivalenceMap = mutable.HashMap.empty[Expr, mutable.ArrayBuffer[Expression]] | ||
| private val equivalenceMap = mutable.LinkedHashMap.empty[Expr, mutable.ArrayBuffer[Expression]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after the new change, does it still need to be LinkedHashMap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, can be reverted back to HashMap, if we are going to sort it at all.
|
Test build #138719 has finished for PR 32586 at commit
|
|
Test build #138718 has finished for PR 32586 at commit
|
| override def compare(x: Expression, y: Expression): Int = { | ||
| if (x.semanticEquals(y)) { | ||
| 0 | ||
| } else if (x.find(_.semanticEquals(y)).isDefined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we run TPCDSQuerySuite and see the time of the query compilation phase? This looks like a very expensive sort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Let me compare before/after this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I think better approach is to sort after filter (e.g. size > 1 in most use-case), because the number of sub-exprs should be smaller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the call usage of getAllEquivalentExprs. So we filter it first and then do sorting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran TPCDSQuerySuite.
Before (master):
23.233160578 seconds
22.501728011 seconds
23.547332524 seconds
After:
23.995751468 seconds
22.262832936 seconds
21.503776059 seconds
I don't see significant difference there.
...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala
Outdated
Show resolved
Hide resolved
|
nit but maybe we should also update the PR description:
|
|
Ya, +1 for @HyukjinKwon 's comment, too. |
|
Thanks @HyukjinKwon @dongjoon-hyun. Yea, updated the PR description. |
…essions/EquivalentExpressions.scala Co-authored-by: Hyukjin Kwon <[email protected]>
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #138767 has finished for PR 32586 at commit
|
|
Test build #138770 has finished for PR 32586 at commit
|
|
Any more concerns or comments around this improvement? Thanks for review. |
|
Thanks all! Merging to master. |
|
Tested this out yesterday and ran into issues occasionally with I guess even though we don't care about the order of non-overlapping subexpressions, the comparator still needs to satisfy certain properties for the sort to work: https://docs.oracle.com/javase/8/docs/api/java/util/Comparator.html#compare-T-T- Would it just make sense to sort based on number of nodes in the tree? I would think a subexpression could only contain another subexpression if it has more expressions in the tree, not sure if there are any weird cases that's not true. Alternatively would only returning 1 or -1 for |
|
Yea, seems like. For non parent-child expressions, the order seems not related, but just need to make it consistent. Will submit a follow up to fix it. |
… sort unrelated expressions ### What changes were proposed in this pull request? This is a followup of #32586. We introduced `ExpressionContainmentOrdering` to sort common expressions according to their parent-child relations. For unrelated expressions, previously the ordering returns -1 which is not correct and can possibly lead to transitivity issue. ### Why are the changes needed? To fix the possible transitivity issue of `ExpressionContainmentOrdering`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #32870 from viirya/SPARK-35439-followup. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>
…expr This patch sorts equivalent expressions based on their child-parent relation. `EquivalentExpressions` maintains a map of equivalent expressions. It is `HashMap` now so the insertion order is not guaranteed to be preserved later. Subexpression elimination relies on retrieving subexpressions from the map. If there is child-parent relationships among the subexpressions, we want the child expressions come first than parent expressions, so we can replace child expressions in parent expressions with subexpression evaluation. For example, we have two different expressions `Add(Literal(1), Literal(2))` and `Add(Literal(3), add)`. Case 1: child subexpr comes first. ```scala addExprTree(add) addExprTree(Add(Literal(3), add)) addExprTree(Add(Literal(3), add)) ``` Case 2: parent subexpr comes first. For this case, we need to sort equivalent expressions. ``` addExprTree(Add(Literal(3), add)) => We add `Add(Literal(3), add)` into the map first, then add `add` into the map addExprTree(add) addExprTree(Add(Literal(3), add)) ``` As we are going to sort equivalent expressions at all, we don't need `LinkedHashMap` but just do sorting. No Added tests. Closes apache#32586 from viirya/use-listhashmap. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]>
What changes were proposed in this pull request?
This patch sorts equivalent expressions based on their child-parent relation.
Why are the changes needed?
EquivalentExpressionsmaintains a map of equivalent expressions. It isHashMapnow so the insertion order is not guaranteed to be preserved later. Subexpression elimination relies on retrieving subexpressions from the map. If there is child-parent relationships among the subexpressions, we want the child expressions come first than parent expressions, so we can replace child expressions in parent expressions with subexpression evaluation.For example, we have two different expressions
Add(Literal(1), Literal(2))andAdd(Literal(3), add).Case 1: child subexpr comes first.
Case 2: parent subexpr comes first. For this case, we need to sort equivalent expressions.
As we are going to sort equivalent expressions at all, we don't need
LinkedHashMapbut just do sorting.Does this PR introduce any user-facing change?
No
How was this patch tested?
Added tests.