Skip to content

Conversation

@dilipbiswal
Copy link
Contributor

What changes were proposed in this pull request?

Currently predicate subqueries (IN/EXISTS) are converted to Joins at the end of optimizer in RewritePredicateSubquery. This change moves the rewrite close to beginning of optimizer. The original idea was to keep the subquery expressions in Filter form so that we can push them down as deep as possible. One disadvantage is that, after the subqueries are rewritten in join form, they are not subjected to further optimizations. In this change, we convert the subqueries to join form early in the rewrite phase.

I will combine the pullupCorrelatedPredicates and RewritePredicateSubquery in a follow-up PR.

How was this patch tested?

A new test suite LeftSemiAntiJoinAndSubqueryEquivalencySuite is added to verify that the correlated subqueries and queries that explicitly use leftsemi/anti joins result in same plan after optmization.

@dilipbiswal
Copy link
Contributor Author

cc @cloud-fan @gatorsmile

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-19712] Move subquery rewrite to beginning of optimizer [SPARK-19712][SQL] Move subquery rewrite to beginning of optimizer Jul 26, 2019
@SparkQA
Copy link

SparkQA commented Jul 26, 2019

Test build #108195 has finished for PR 25258 at commit 393f5b9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

gatorsmile commented Jul 26, 2019

Also cc @maryannxue @hvanhovell

*
* p2 is usually inserted by this rule and useless, p1 could prune the columns anyway.
*/
object FinalColumnPruning extends Rule[LogicalPlan] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to separate the column pruning rule?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a subquery related bug fix: #25204

Is it related to your change as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I took a very quick look. It does not seem related to this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan

why do we need to separate the column pruning rule?

Perhaps there is a better way to do this. But here is the problem. Please take a look at RewriteSubquerySuite: Column pruning after rewriting predicate subquery. This test case is expecting that we perform column pruning to filter out un-needed columns before the join. Here is the input plan :

Project [a#0]                    
 +- Join LeftSemi, (a#0 = x#2)    
   +-LocalRelation [a#0, b#1]       
   +- LocalRelation [x#2]

Due to the presence Project on top of LeftSemi, the regular ColumnPruning rule is not able to add the Project on top of the left child of LeftSemiJoin. This is done to avoid the cycle between ColumnPruning and PushPredicateThroughProject. Thats why i created this FinalColumnPruning rule that does not have the logic to remove the project.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done to avoid the cycle between ColumnPruning and PushPredicateThroughProject

I don't see a filter in the input plan, how is it related to PushPredicateThroughProject?

Copy link
Contributor Author

@dilipbiswal dilipbiswal Jul 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan so the LeftSemi/Anti pattern is treated like a Filter in modified ColumnPruning rule. Since we convert the subqueries (which was in Filter form) to join early now, we are basically treating it like a Filter in related rules.

how is it related to PushPredicateThroughProject

Sorry.. just to clarify, In the prior PR, we have added PushDownLeftSemiAntiJoin where a LeftSemi/Anti join is pushed down below Project. So the Cycle would be between ColumnPruning and PushDownLeftSemiAntiJoin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see, thanks for explaination!

@SparkQA
Copy link

SparkQA commented Jul 26, 2019

Test build #108229 has finished for PR 25258 at commit 022b680.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 29, 2019

Test build #108296 has finished for PR 25258 at commit afa9357.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 29, 2019

Test build #108297 has finished for PR 25258 at commit 49e41c9.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jul 29, 2019

Test build #108309 has finished for PR 25258 at commit 49e41c9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 30, 2019

Test build #108347 has finished for PR 25258 at commit 24be057.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

OptimizeSubqueries) ::
OptimizeSubqueries,
PullupCorrelatedPredicates,
RewritePredicateSubquery) ::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it affect CBO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile Could you please elaborate ? My hope is that we can get better optimized plans as we will expose the full plan to entire set of optimizations ? As an example, lets say we have a rule that can convert leftsemi join to inner join, with this, we can hope to have this conversion and if we do, then CBO will be able to impact positively to re-order joins as required. Please let me know if i am missing something..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our CBO is based on the pattern matching. We only support InnerLike joins.

Is that possible the pattern matched before but it does not matched after we convert subquery to join? RewritePredicateSubquery adds LeftSemi or LeftAnti joins into the original plan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile At the moment, i am unable to think of a case when by moving the rewrite up would impact the pattern matching in CBO in a negative way since RewritePredicateSubquery needs to happen any way (it just happens later) . The leftsemi/anti join should end up being in the same position in the plan tree as we have the same pushdown rules for leftsemi and leftanti as there are for filters. If you have a case in mind, i can give it a quick try ?

@SparkQA
Copy link

SparkQA commented Dec 18, 2019

Test build #115505 has finished for PR 25258 at commit 24be057.

  • This patch fails build dependency tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Mar 28, 2020
@github-actions github-actions bot closed this Mar 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants