[SPARK-31535][SQL] Fix nested CTE substitution #28318

peter-toth · 2020-04-23T18:47:56Z

What changes were proposed in this pull request?

This PR fixes a CTE substitution issue so as to the following SQL return the correct empty result:

WITH t(c) AS (SELECT 1)
SELECT * FROM t
WHERE c IN (
  WITH t(c) AS (SELECT 2)
  SELECT * FROM t
)

Before this PR the result was 1.

Why are the changes needed?

To fix a correctness issue.

Does this PR introduce any user-facing change?

Yes, fixes a correctness issue.

How was this patch tested?

Added new test case.

peter-toth · 2020-04-23T18:56:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala

+  private def traverseAndSubstituteCTE(plan: LogicalPlan): LogicalPlan = {
    plan.resolveOperatorsUp {
      case With(child: LogicalPlan, relations) =>
-        // child might contain an inner CTE that has priority so traverse and substitute inner CTEs


This was an incorrect assumption that only direct child can contain a nested, conflicting CTE as subquery expression. As the added new test case shows such a conflicting CTE can be in any node under With.

peter-toth · 2020-04-23T18:58:40Z

cc @cloud-fan, @dongjoon-hyun

dongjoon-hyun · 2020-04-23T19:43:50Z

Thank you for reporting and thank you for quick fixing, @peter-toth . I'll take a look today~

sql/core/src/test/resources/sql-tests/results/cte-nonlegacy.sql.out

dongjoon-hyun · 2020-04-23T19:50:41Z

sql/core/src/test/resources/sql-tests/results/cte-legacy.sql.out

+-- !query schema
+struct<c:int>
+-- !query output
+1


Thank you for adding a legacy result.

SparkQA · 2020-04-23T23:42:49Z

Test build #121697 has finished for PR 28318 at commit 686a827.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-04-24T01:50:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala

-      cteNames: Set[String] = Set.empty): Unit = {
-    plan.foreach {
+      namesInChildren: Set[String] = Set.empty,
+      namesInExpressions: Set[String] = Set.empty): Unit = {


Why do we separate children and expression? Can we just have one parameter outerNames?

IMHO this trick is necessary to find ambiguous situations in one traversal pass. Ambiguous here doesn't only mean the the same CTE name is used again in a nested CTE, but it is also important that we don't want to throw an exception when the legacy and the corrected substitution returns the same result. E.g.

WITH t(c) AS (SELECT 1) SELECT max(c) FROM ( WITH t(c) AS (SELECT 2) SELECT * FROM t )

returns 2 in both modes so we don't throw the exception.

Do we really support this? I don't think it's possible at analysis time as the data can be from a table.

And from the current code, namesInExpressions is not used, just carry over with recursive calls.

Yes, we do. What we want to catch here is that legacy and corrected modes MAY produce different results due to the different substitution order on the same data. But there are cases like the above when both modes does the substitution in the same order and we can be sure that the result is the same.

We usually say that in legacy mode the outer CTE takes precedence while in corrected mode the inner one does.
While this statement is entirely true for corrected mode, the legacy mode is a bit weird in terms that in case of we have a nested conflicting CTE in a subquery the inner takes precedence (similar to the corrected mode). But if we have a nested CTE in a subquery expression or in an inner children the outer does (in contrast to the corrected mode).

namesInExpressions is used, when we traverse subquery expressions or inner children (CTE definitions) we pass namesInExpressions into namesInChildren.
I admit that the naming of these parameters might be confusing namesInChildren is what we actually check for conflicts.
namesInChildren is what we pass along when we traverse children (in this case the 2 different modes do the substitution in the same order and inner always takes precedence so we don't expand this set with new names when we encounter a With).
namesInExpressions is what we pass along when we traverse subquery expressions and inner children (in this case the 2 different modes do the substitution different order). But once we encounter a subquery expressions or an inner children we need to switch the possible conflicting set to the namesInExpressions.

WITH t(c) AS (SELECT 1) SELECT max(c) FROM ( WITH t(c) AS (SELECT 2) SELECT * FROM t )

This query can't be parsed by Spark 2.4. Seems we support CTE in subquery since 3.0. So legacy behavior is the same as the correct behavior.

However, it's a subquery, not subquery expression. I think namesInExpressions is a confusing name. How about naming the two parameters as outerCTERelationNames and namesInSubqueries?

Ok, renamed.

cloud-fan · 2020-04-24T12:34:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala

-      case _ =>
+      case other =>
+        other.subqueries.foreach(
+          assertNoNameConflictsInCTE(_, namesInExpressions, namesInExpressions))


Do we need to pass namesInExpressions twice?

I think we do. What would you pass in here?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala

cloud-fan · 2020-04-24T13:35:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala

-      case _ =>
+      case other =>
+        other.subqueries.foreach(
+          assertNoNameConflictsInCTE(_, namesInSubqueries, namesInSubqueries))


shouldn't this be assertNoNameConflictsInCTE(_, namesInSubqueries, Nil)?

In that case we wouldn't report

WITH t AS (SELECT 1) SELECT ( SELECT ( WITH t AS (SELECT 2) SELECT * FROM t ) )

as ambiguous, but it returns 1 in legacy and 2 in corrected modes.

cloud-fan · 2020-04-24T14:18:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala

-            e
-        }
+        }.toSet ++ namesInSubqueries
+        assertNoNameConflictsInCTE(child, outerCTERelationNames, newNames)


There are 3 kinds of conflicts:

conflicts in the children, e.g. with t as (...) select * from (with t as (...) select * from t)

conflicts in other CTE relations, e.g. with t as (...), t2 as (with t as (...) select * from t) ...

conflicts in the subqueries of the children, e.g. with t as (...) select ... where ... in (with t as (...) select * from t)

The third one can be handled correctly by 2.4 as well, so we shouldn't fail.

Can you briefly explain how you check the first 2 conflicts and skip the third one?

I thought 3. is not handled correctly in 2.4. (I haven't tested it on Spark 2.4.) But it certainly produces different results in Spark 3 in the 2 different modes.

In legacy it returns 1 (https://github.com/apache/spark/pull/28318/files#diff-46573f0caf504536cf46d2b267bb5a37R171-R181). This PR hasn't touched this mode, but if you are right then it should.

In corrected it returns empty result after this PR (before the PR it returned 1)(https://github.com/apache/spark/pull/28318/files#diff-0c15df52c8221453db6102ff4196cb90R171-R181).

I checked 3. in Spark 2.4 with the query in the description and it returns 1 so is not handled correctly and we should fail.

In this case the legacy and corrected mode returns the same result due to the same substitution order so we shouldn't throw an exception. If we call the possible conflicting set of names coming from outer CTEs as outerCTERelationNames then I handle this case by not extending the set with new names when I traverse down the child of a With (https://github.com/apache/spark/pull/28318/files#diff-d0bfa3367c63988ad7cf33397e643e75R64) or when I traverse down the children of any other node (https://github.com/apache/spark/pull/28318/files#diff-d0bfa3367c63988ad7cf33397e643e75R70-R71).

and 3. In these cases, the two modes have different substitution order and so possibly different results. Obviously we need an extended set of names (newNames) when we encounter a With and switch to that set (pass in as a new outerCTERelationNames) when we traverse down inner children (2.)( https://github.com/apache/spark/pull/28318/files#diff-d0bfa3367c63988ad7cf33397e643e75R65) or traverse down subquery expressions (3.)(https://github.com/apache/spark/pull/28318/files#diff-d0bfa3367c63988ad7cf33397e643e75R68-R69)

So what is the contents of outerCTERelationNames? The set of CTE names coming from outer CTEs. It doesn't contain the CTEs coming from a parent With from the same LogicalPlan (1.). But it contains outer CTEs coming from Withs from outer LogicalPlans (2., 3.)
And the content of namesInSubqueries is simply all CTE names from all outer Withs, no matter if the With is the same plan or in an outer plan.

So I think what we need to do here is basically skip the 1. and check 2. and 3.

I agree with @peter-toth .

SparkQA · 2020-04-24T19:01:25Z

Test build #121764 has finished for PR 28318 at commit 4a3ebf7.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-04-25T21:17:06Z

Retest this please.

SparkQA · 2020-04-26T02:34:10Z

Test build #121817 has finished for PR 28318 at commit 4a3ebf7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you, @peter-toth and @cloud-fan .
I'll merge this PR since this fixes the reported case.

BTW, @cloud-fan and @peter-toth . If we want to switch the default to legacy, shall we file a new JIRA issue under SPARK-31085 and continue to discuss?

cc @gatorsmile

### What changes were proposed in this pull request? This PR fixes a CTE substitution issue so as to the following SQL return the correct empty result: ``` WITH t(c) AS (SELECT 1) SELECT * FROM t WHERE c IN ( WITH t(c) AS (SELECT 2) SELECT * FROM t ) ``` Before this PR the result was `1`. ### Why are the changes needed? To fix a correctness issue. ### Does this PR introduce any user-facing change? Yes, fixes a correctness issue. ### How was this patch tested? Added new test case. Closes #28318 from peter-toth/SPARK-31535-fix-nested-cte-substitution. Authored-by: Peter Toth <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 4f53bfb) Signed-off-by: Dongjoon Hyun <[email protected]>

peter-toth · 2020-04-27T07:07:18Z

Thanks @cloud-fan, @dongjoon-hyun for the review.

…blems when check name conflicts of CTE relations ### What changes were proposed in this pull request? This is a followup of #28318, to make the code more readable, by adding some comments to explain the trick and simplify the code to use a boolean flag instead of 2 string sets. This PR also fixes various problems: 1. the name check should consider case sensitivity 2. forward name conflicts like `with t as (with t2 as ...), t2 as ...` is not a real conflict and we shouldn't fail. ### Why are the changes needed? correct the behavior ### Does this PR introduce any user-facing change? yes, fix the fore-mentioned behaviors. ### How was this patch tested? new tests Closes #28371 from cloud-fan/followup. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…blems when check name conflicts of CTE relations ### What changes were proposed in this pull request? This is a followup of #28318, to make the code more readable, by adding some comments to explain the trick and simplify the code to use a boolean flag instead of 2 string sets. This PR also fixes various problems: 1. the name check should consider case sensitivity 2. forward name conflicts like `with t as (with t2 as ...), t2 as ...` is not a real conflict and we shouldn't fail. ### Why are the changes needed? correct the behavior ### Does this PR introduce any user-facing change? yes, fix the fore-mentioned behaviors. ### How was this patch tested? new tests Closes #28371 from cloud-fan/followup. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 2f4f38b) Signed-off-by: Dongjoon Hyun <[email protected]>

[SPARK-31535][SQL] Fix nested CTE substitution

686a827

probot-autolabeler bot added the SQL label Apr 23, 2020

peter-toth commented Apr 23, 2020

View reviewed changes

peter-toth mentioned this pull request Apr 23, 2020

[SPARK-28228][SQL] Fix substitution order of nested WITH clauses #25029

Closed

dongjoon-hyun reviewed Apr 23, 2020

View reviewed changes

sql/core/src/test/resources/sql-tests/results/cte-nonlegacy.sql.out Show resolved Hide resolved

dongjoon-hyun reviewed Apr 23, 2020

View reviewed changes

cloud-fan reviewed Apr 24, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala Show resolved Hide resolved

rename parameters

4a3ebf7

cloud-fan reviewed Apr 24, 2020

View reviewed changes

dongjoon-hyun approved these changes Apr 26, 2020

View reviewed changes

dongjoon-hyun closed this in 4f53bfb Apr 26, 2020

cloud-fan mentioned this pull request Apr 27, 2020

[SPARK-31577][SQL] Fix case-sensitivity and forward name conflict problems when check name conflicts of CTE relations #28371

Closed

[SPARK-31535][SQL] Fix nested CTE substitution #28318

[SPARK-31535][SQL] Fix nested CTE substitution #28318

Uh oh!

Conversation

peter-toth commented Apr 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth commented Apr 23, 2020

Uh oh!

dongjoon-hyun commented Apr 23, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 23, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth Apr 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Apr 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth Apr 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth Apr 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 24, 2020

Uh oh!

dongjoon-hyun commented Apr 25, 2020

Uh oh!

SparkQA commented Apr 26, 2020

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth commented Apr 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

peter-toth commented Apr 23, 2020 •

edited

Loading

peter-toth Apr 24, 2020 •

edited

Loading

cloud-fan Apr 24, 2020 •

edited

Loading

peter-toth Apr 24, 2020 •

edited

Loading

peter-toth Apr 24, 2020 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading