[WIP][SPARK-19712][SQL] Move PullupCorrelatedPredicates and RewritePredicateSubquery after OptimizeSubqueries #17520

nsyca · 2017-04-03T16:49:04Z

What changes were proposed in this pull request?

This commit moves two rules right next to the rule OptimizeSubqueries.

PullupCorrelatedPredicates: the rewrite of [Not] Exists and [Not] In (ListQuery) to PredicateSubquery
RewritePredicateSubquery: the rewrite of PredicateSubquery to LeftSemi/LeftAnti

With this change, [Not] Exists/In subquery is now rewritten to LeftSemi/LeftAnti at the beginning of Optimizer.
One Todo is to merge the two-stage rewrite in rule PullupCorrelatedPredicates and rule RewritePredicateSubquery into a single stage rewrite rule.

How was this patch tested?

Unit tests with test cases in SQLQueryTestSuite under the directory ./sql/core/src/test/resources/sql-tests/inputs/subquery.

…rrect results ## What changes were proposed in this pull request? This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase. ## How was this patch tested? ./dev/run-tests a new unit test on the problematic pattern.

…timizeSubqueries This commit moves two rules right next to the rule OptimizeSubqueries. 1. PullupCorrelatedPredicates: the rewrite of [Not] Exists and [Not] In (ListQuery) to PredicateSubquery 2. RewritePredicateSubquery: the rewrite of PredicateSubquery to LeftSemi/LeftAnti With this change, [Not] Exists/In subquery is now rewritten to LeftSemi/LeftAnti at the beginning of Optimizer. By moving rule PullupCorrelatedPredicates after rule OptimizerSubqueries, all the rules from the nested call to the entire Optimizer on the plans in subqueries will need to deal with (1). the correlated columns wrapped with OuterReference, and (2) the SubqueryExpression. We will block any push down of both types of expressions for the following reasons: 1. We do not want to push any correlated expressions further down the plan tree. Deep correlation is not yet supported in Spark, and, even when supported, deep correlation is more difficult to be unnested to a join. 2. We do not want to push any correlated subquery down because the correlated columns' ExprIds in the subquery may need to remap to different ExprIds from the plan below the current Filter that hosts the subquery. Another side effect is we used to push down Exists/In subquery as if it is a predicate in rule PushDownPredicate and rule PushPredicateThroughJoin. Now Exists/In subquery is rewritten to LeftSemi/LeftAnti, we need to handle the push down of LeftSemi/LeftAnti instead. This will be done in a followup commit. Another Todo is to merge the two-stage rewrite in rule PullupCorrelatedPredicates and rule RewritePredicateSubquery into a single stage rewrite.

SparkQA · 2017-04-03T19:04:37Z

Test build #75483 has finished for PR 17520 at commit 380d5d7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

1. Make PushDownPrecidate aware of LeftSemi/LeftAnti 2. Add new rule PUshLeftSemiAntiThroughJoin 3. Extend EliminateOuterJoin to deal with LeftSemi/LeftAnti

…TC 1.3

nsyca · 2017-04-03T20:36:10Z

Commit bc4fe93 is an initial work to demonstrate the idea of merging the 2-stage transformation of [NOT] Exists/IN subquery into LeftSemi/LeftAnti. It has the skeleton of the work but needs to fill in more details.

More explanation...

By moving rule PullupCorrelatedPredicates after rule OptimizerSubqueries, all the rules from the nested call to the entire Optimizer on the plans in subqueries will need to deal with (1) the correlated columns wrapped with OuterReference, and (2) the SubqueryExpression.

We will block any push down of both types of expressions for the following reasons:

We do not want to push any correlated expressions further down the plan tree. Deep correlation is not yet supported in Spark, and, even when supported, deep correlation is more difficult to be unnested to a join.
We do not want to push any correlated subquery down because the correlated columns' ExprIds in the subquery may need to remap to different ExprIds from the plan below the current Filter that hosts the subquery.

One side effect is we used to push down Exists/In subquery as if it is a predicate in rule PushDownPredicate and rule PushPredicateThroughJoin. Now Exists/In subquery is rewritten to LeftSemi/LeftAnti, we need to handle the push down of LeftSemi/LeftAnti instead. This will be done in a followup commit.

nsyca · 2017-04-03T20:36:55Z

Commit 4aaab02 has the complete functionality and new test cases.

nsyca · 2017-04-03T20:43:49Z