-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19712][SQL] Move subquery rewrite to beginning of optimizer #25258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #108195 has finished for PR 25258 at commit
|
|
Also cc @maryannxue @hvanhovell |
| * | ||
| * p2 is usually inserted by this rule and useless, p1 could prune the columns anyway. | ||
| */ | ||
| object FinalColumnPruning extends Rule[LogicalPlan] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to separate the column pruning rule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a subquery related bug fix: #25204
Is it related to your change as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I took a very quick look. It does not seem related to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to separate the column pruning rule?
Perhaps there is a better way to do this. But here is the problem. Please take a look at RewriteSubquerySuite: Column pruning after rewriting predicate subquery. This test case is expecting that we perform column pruning to filter out un-needed columns before the join. Here is the input plan :
Project [a#0]
+- Join LeftSemi, (a#0 = x#2)
+-LocalRelation [a#0, b#1]
+- LocalRelation [x#2]
Due to the presence Project on top of LeftSemi, the regular ColumnPruning rule is not able to add the Project on top of the left child of LeftSemiJoin. This is done to avoid the cycle between ColumnPruning and PushPredicateThroughProject. Thats why i created this FinalColumnPruning rule that does not have the logic to remove the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done to avoid the cycle between ColumnPruning and PushPredicateThroughProject
I don't see a filter in the input plan, how is it related to PushPredicateThroughProject?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan so the LeftSemi/Anti pattern is treated like a Filter in modified ColumnPruning rule. Since we convert the subqueries (which was in Filter form) to join early now, we are basically treating it like a Filter in related rules.
how is it related to PushPredicateThroughProject
Sorry.. just to clarify, In the prior PR, we have added PushDownLeftSemiAntiJoin where a LeftSemi/Anti join is pushed down below Project. So the Cycle would be between ColumnPruning and PushDownLeftSemiAntiJoin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see, thanks for explaination!
|
Test build #108229 has finished for PR 25258 at commit
|
022b680 to
afa9357
Compare
|
Test build #108296 has finished for PR 25258 at commit
|
|
Test build #108297 has finished for PR 25258 at commit
|
|
retest this please |
|
Test build #108309 has finished for PR 25258 at commit
|
|
Test build #108347 has finished for PR 25258 at commit
|
| OptimizeSubqueries) :: | ||
| OptimizeSubqueries, | ||
| PullupCorrelatedPredicates, | ||
| RewritePredicateSubquery) :: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it affect CBO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile Could you please elaborate ? My hope is that we can get better optimized plans as we will expose the full plan to entire set of optimizations ? As an example, lets say we have a rule that can convert leftsemi join to inner join, with this, we can hope to have this conversion and if we do, then CBO will be able to impact positively to re-order joins as required. Please let me know if i am missing something..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our CBO is based on the pattern matching. We only support InnerLike joins.
Is that possible the pattern matched before but it does not matched after we convert subquery to join? RewritePredicateSubquery adds LeftSemi or LeftAnti joins into the original plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile At the moment, i am unable to think of a case when by moving the rewrite up would impact the pattern matching in CBO in a negative way since RewritePredicateSubquery needs to happen any way (it just happens later) . The leftsemi/anti join should end up being in the same position in the plan tree as we have the same pushdown rules for leftsemi and leftanti as there are for filters. If you have a case in mind, i can give it a quick try ?
|
Test build #115505 has finished for PR 25258 at commit
|
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Currently predicate subqueries (IN/EXISTS) are converted to Joins at the end of optimizer in RewritePredicateSubquery. This change moves the rewrite close to beginning of optimizer. The original idea was to keep the subquery expressions in Filter form so that we can push them down as deep as possible. One disadvantage is that, after the subqueries are rewritten in join form, they are not subjected to further optimizations. In this change, we convert the subqueries to join form early in the rewrite phase.
I will combine the pullupCorrelatedPredicates and RewritePredicateSubquery in a follow-up PR.
How was this patch tested?
A new test suite
LeftSemiAntiJoinAndSubqueryEquivalencySuiteis added to verify that the correlated subqueries and queries that explicitly use leftsemi/anti joins result in same plan after optmization.