Skip to content

Conversation

@peter-toth
Copy link
Contributor

@peter-toth peter-toth commented Apr 23, 2020

What changes were proposed in this pull request?

This PR fixes a CTE substitution issue so as to the following SQL return the correct empty result:

WITH t(c) AS (SELECT 1)
SELECT * FROM t
WHERE c IN (
  WITH t(c) AS (SELECT 2)
  SELECT * FROM t
)

Before this PR the result was 1.

Why are the changes needed?

To fix a correctness issue.

Does this PR introduce any user-facing change?

Yes, fixes a correctness issue.

How was this patch tested?

Added new test case.

private def traverseAndSubstituteCTE(plan: LogicalPlan): LogicalPlan = {
plan.resolveOperatorsUp {
case With(child: LogicalPlan, relations) =>
// child might contain an inner CTE that has priority so traverse and substitute inner CTEs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an incorrect assumption that only direct child can contain a nested, conflicting CTE as subquery expression. As the added new test case shows such a conflicting CTE can be in any node under With.

@peter-toth
Copy link
Contributor Author

cc @cloud-fan, @dongjoon-hyun

@dongjoon-hyun
Copy link
Member

Thank you for reporting and thank you for quick fixing, @peter-toth . I'll take a look today~

-- !query schema
struct<c:int>
-- !query output
1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding a legacy result.

@SparkQA
Copy link

SparkQA commented Apr 23, 2020

Test build #121697 has finished for PR 28318 at commit 686a827.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

cteNames: Set[String] = Set.empty): Unit = {
plan.foreach {
namesInChildren: Set[String] = Set.empty,
namesInExpressions: Set[String] = Set.empty): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we separate children and expression? Can we just have one parameter outerNames?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this trick is necessary to find ambiguous situations in one traversal pass. Ambiguous here doesn't only mean the the same CTE name is used again in a nested CTE, but it is also important that we don't want to throw an exception when the legacy and the corrected substitution returns the same result. E.g.

WITH t(c) AS (SELECT 1)
SELECT max(c) FROM (
  WITH t(c) AS (SELECT 2)
  SELECT * FROM t
)

returns 2 in both modes so we don't throw the exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really support this? I don't think it's possible at analysis time as the data can be from a table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And from the current code, namesInExpressions is not used, just carry over with recursive calls.

Copy link
Contributor Author

@peter-toth peter-toth Apr 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do. What we want to catch here is that legacy and corrected modes MAY produce different results due to the different substitution order on the same data. But there are cases like the above when both modes does the substitution in the same order and we can be sure that the result is the same.

We usually say that in legacy mode the outer CTE takes precedence while in corrected mode the inner one does.
While this statement is entirely true for corrected mode, the legacy mode is a bit weird in terms that in case of we have a nested conflicting CTE in a subquery the inner takes precedence (similar to the corrected mode). But if we have a nested CTE in a subquery expression or in an inner children the outer does (in contrast to the corrected mode).

namesInExpressions is used, when we traverse subquery expressions or inner children (CTE definitions) we pass namesInExpressions into namesInChildren.
I admit that the naming of these parameters might be confusing namesInChildren is what we actually check for conflicts.
namesInChildren is what we pass along when we traverse children (in this case the 2 different modes do the substitution in the same order and inner always takes precedence so we don't expand this set with new names when we encounter a With).
namesInExpressions is what we pass along when we traverse subquery expressions and inner children (in this case the 2 different modes do the substitution different order). But once we encounter a subquery expressions or an inner children we need to switch the possible conflicting set to the namesInExpressions.

Copy link
Contributor

@cloud-fan cloud-fan Apr 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WITH t(c) AS (SELECT 1)
SELECT max(c) FROM (
  WITH t(c) AS (SELECT 2)
  SELECT * FROM t
)

This query can't be parsed by Spark 2.4. Seems we support CTE in subquery since 3.0. So legacy behavior is the same as the correct behavior.

However, it's a subquery, not subquery expression. I think namesInExpressions is a confusing name. How about naming the two parameters as outerCTERelationNames and namesInSubqueries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, renamed.

case _ =>
case other =>
other.subqueries.foreach(
assertNoNameConflictsInCTE(_, namesInExpressions, namesInExpressions))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pass namesInExpressions twice?

Copy link
Contributor Author

@peter-toth peter-toth Apr 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do. What would you pass in here?

case _ =>
case other =>
other.subqueries.foreach(
assertNoNameConflictsInCTE(_, namesInSubqueries, namesInSubqueries))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be assertNoNameConflictsInCTE(_, namesInSubqueries, Nil)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case we wouldn't report

WITH t AS (SELECT 1)
SELECT (
  SELECT (
    WITH t AS (SELECT 2)
    SELECT * FROM t
  )
)

as ambiguous, but it returns 1 in legacy and 2 in corrected modes.

e
}
}.toSet ++ namesInSubqueries
assertNoNameConflictsInCTE(child, outerCTERelationNames, newNames)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 3 kinds of conflicts:

  1. conflicts in the children, e.g. with t as (...) select * from (with t as (...) select * from t)
  2. conflicts in other CTE relations, e.g. with t as (...), t2 as (with t as (...) select * from t) ...
  3. conflicts in the subqueries of the children, e.g. with t as (...) select ... where ... in (with t as (...) select * from t)

The third one can be handled correctly by 2.4 as well, so we shouldn't fail.

Can you briefly explain how you check the first 2 conflicts and skip the third one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought 3. is not handled correctly in 2.4. (I haven't tested it on Spark 2.4.) But it certainly produces different results in Spark 3 in the 2 different modes.

Copy link
Contributor Author

@peter-toth peter-toth Apr 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked 3. in Spark 2.4 with the query in the description and it returns 1 so is not handled correctly and we should fail.

  1. In this case the legacy and corrected mode returns the same result due to the same substitution order so we shouldn't throw an exception. If we call the possible conflicting set of names coming from outer CTEs as outerCTERelationNames then I handle this case by not extending the set with new names when I traverse down the child of a With (https://github.com/apache/spark/pull/28318/files#diff-d0bfa3367c63988ad7cf33397e643e75R64) or when I traverse down the children of any other node (https://github.com/apache/spark/pull/28318/files#diff-d0bfa3367c63988ad7cf33397e643e75R70-R71).

  2. and 3. In these cases, the two modes have different substitution order and so possibly different results. Obviously we need an extended set of names (newNames) when we encounter a With and switch to that set (pass in as a new outerCTERelationNames) when we traverse down inner children (2.)( https://github.com/apache/spark/pull/28318/files#diff-d0bfa3367c63988ad7cf33397e643e75R65) or traverse down subquery expressions (3.)(https://github.com/apache/spark/pull/28318/files#diff-d0bfa3367c63988ad7cf33397e643e75R68-R69)

So what is the contents of outerCTERelationNames? The set of CTE names coming from outer CTEs. It doesn't contain the CTEs coming from a parent With from the same LogicalPlan (1.). But it contains outer CTEs coming from Withs from outer LogicalPlans (2., 3.)
And the content of namesInSubqueries is simply all CTE names from all outer Withs, no matter if the With is the same plan or in an outer plan.

So I think what we need to do here is basically skip the 1. and check 2. and 3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @peter-toth .

@SparkQA
Copy link

SparkQA commented Apr 24, 2020

Test build #121764 has finished for PR 28318 at commit 4a3ebf7.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Apr 26, 2020

Test build #121817 has finished for PR 28318 at commit 4a3ebf7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @peter-toth and @cloud-fan .
I'll merge this PR since this fixes the reported case.

BTW, @cloud-fan and @peter-toth . If we want to switch the default to legacy, shall we file a new JIRA issue under SPARK-31085 and continue to discuss?

cc @gatorsmile

dongjoon-hyun pushed a commit that referenced this pull request Apr 26, 2020
### What changes were proposed in this pull request?

This PR fixes a CTE substitution issue so as to the following SQL return the correct empty result:
```
WITH t(c) AS (SELECT 1)
SELECT * FROM t
WHERE c IN (
  WITH t(c) AS (SELECT 2)
  SELECT * FROM t
)
```
Before this PR the result was `1`.

### Why are the changes needed?
To fix a correctness issue.

### Does this PR introduce any user-facing change?
Yes, fixes a correctness issue.

### How was this patch tested?
Added new test case.

Closes #28318 from peter-toth/SPARK-31535-fix-nested-cte-substitution.

Authored-by: Peter Toth <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 4f53bfb)
Signed-off-by: Dongjoon Hyun <[email protected]>
@peter-toth
Copy link
Contributor Author

Thanks @cloud-fan, @dongjoon-hyun for the review.

dongjoon-hyun pushed a commit that referenced this pull request Apr 27, 2020
…blems when check name conflicts of CTE relations

### What changes were proposed in this pull request?

This is a followup of #28318, to make the code more readable, by adding some comments to explain the trick and simplify the code to use a boolean flag instead of 2 string sets.

This PR also fixes various problems:
1. the name check should consider case sensitivity
2. forward name conflicts like `with t as (with t2 as ...), t2 as ...` is not a real conflict and we shouldn't fail.

### Why are the changes needed?

correct the behavior

### Does this PR introduce any user-facing change?

yes, fix the fore-mentioned behaviors.

### How was this patch tested?

new tests

Closes #28371 from cloud-fan/followup.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Apr 27, 2020
…blems when check name conflicts of CTE relations

### What changes were proposed in this pull request?

This is a followup of #28318, to make the code more readable, by adding some comments to explain the trick and simplify the code to use a boolean flag instead of 2 string sets.

This PR also fixes various problems:
1. the name check should consider case sensitivity
2. forward name conflicts like `with t as (with t2 as ...), t2 as ...` is not a real conflict and we shouldn't fail.

### Why are the changes needed?

correct the behavior

### Does this PR introduce any user-facing change?

yes, fix the fore-mentioned behaviors.

### How was this patch tested?

new tests

Closes #28371 from cloud-fan/followup.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 2f4f38b)
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants