-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19799][SQL] Support WITH clause in subqueries #24831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
|
Test build #106360 has finished for PR 24831 at commit
|
|
cc @gatorsmile since this is a part of PostgreSQL feature parity. |
741c727 to
0b516fc
Compare
|
Test build #106461 has finished for PR 24831 at commit
|
0b516fc to
ca27852
Compare
|
Test build #106541 has finished for PR 24831 at commit
|
ca27852 to
d76a265
Compare
|
Test build #106793 has finished for PR 24831 at commit
|
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mgaido91! It is not needed indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember now why I used lazy here. A CTE definition can be used multiple times in WITH but the call by name parameter (ctePlan = traverseAndSubstituteCTE(...)) should be executed only once.
But now I believe it is better to use lazy outside of substituteCTE than inside, please review my commit 7d69105.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala
Outdated
Show resolved
Hide resolved
|
Test build #107069 has finished for PR 24831 at commit
|
|
Test build #107073 has finished for PR 24831 at commit
|
|
Test build #107079 has finished for PR 24831 at commit
|
|
Test build #107084 has finished for PR 24831 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need this flag? Some tests would fail if we have not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it would not, but I wanted to do CTE substitution in the current plan only (not in the subqueries) if it is safe. (CTE substitution will run for subqueries later anyway.)
|
@peter-toth Can we split this pr into two parts cleanly: 1. the sub-query support and 2. the behaviour change? Also, I think we need to update the migration guide for the part 2. cc: @gatorsmile |
@maropu, all right, I dropped the changes that relate to order of substitution and will do it in another ticket. What remained here is just the |
|
Yea, thanks! |
Very welcome. |
|
Test build #107101 has finished for PR 24831 at commit
|
|
Test build #107106 has finished for PR 24831 at commit
|
| /** | ||
| * Analyze WITH nodes and substitute child plan with CTE definitions. | ||
| */ | ||
| object CTESubstitution extends Rule[LogicalPlan] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid moving the class, in order to keep the diff smaller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea of moving the rule to a separate file came from here: #24831 (comment), but I think you are right @mgaido91, because we cut the scope and split the PR since that. Maybe the other part (#25029) could extract the rule to a separate file as that one makes the rule a bit more complicated. Does that work for you @maropu?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was what I meant, we can move the rule in the other PR which refactors it more thoroughly.
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
Show resolved
Hide resolved
| plan resolveOperatorsUp { | ||
| case UnresolvedRelation(Seq(table)) if resolver(cteName, table) => | ||
| ctePlan | ||
| case u: UnresolvedRelation => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this line does anything nor UnresolvedRelation can have an expression so I thought it is safe and good idea to remove the line. Please correct me if I'm wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think you're right, I was just curious about the reason of this change
mgaido91
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only a style comment, otherwise LGTM, thanks!
| plan resolveOperatorsUp { | ||
| case UnresolvedRelation(Seq(table)) if resolver(cteName, table) => | ||
| ctePlan | ||
| case u: UnresolvedRelation => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think you're right, I was just curious about the reason of this change
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
| -- !query 20 schema | ||
| struct<> | ||
| -- !query 20 output | ||
| org.apache.spark.sql.AnalysisException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question, is this going to be addressed in the PR which allows recursive subqueries or is this an invalid query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a WIP PR open #23531 that would add support for recursive queries (and subqueries and subquery expressions too). But these queries lack the RECURSIVE keyword and using an outer recursive reference in a subquery is not allowed (next query) according to the SQL standard so these will never become valid.
But, this PR should be accepted first then could come #25029 and #23531
Actually I think I'm removing the test WITH r AS (SELECT * FROM r) SELECT * FROM r; because there is already a similar one in cte.sql and moving the WITH r AS (SELECT (SELECT * FROM r)) SELECT * FROM r; next to the existing one.
|
Test build #107168 has finished for PR 24831 at commit
|
|
Test build #107175 has finished for PR 24831 at commit
|
|
Test build #107178 has finished for PR 24831 at commit
|
| struct<1:int> | ||
| -- !query 12 output | ||
| 1 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This result is different from the pg one;
postgres=# WITH
postgres-# t AS (SELECT 1),
postgres-# t2 AS (
postgres(# WITH t AS (SELECT 2)
postgres(# SELECT * FROM t
postgres(# )
postgres-# SELECT * FROM t2;
?column?
----------
2
(1 row)
This will be address in the following #25029?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree that this is inevitable in this PR. (cc @gatorsmile ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, after #25029 it will return 2 (https://github.com/apache/spark/pull/25029/files#diff-fc515a5db268d29b08b80f5eb8202026R145)
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @peter-toth , @maropu , @mgaido91 .
The original PR is split into two (this and #25029) according to the review comment.
This is a new feature at Spark 3.0.0 and will be consistent with PostgreSQL soon.
Merged to master to move forward.
|
Thanks @dongjoon-hyun, @maropu, @mgaido91 for the review! I will prepare #25029 for review soon. |
What changes were proposed in this pull request?
This PR adds support of
WITHclause within a subquery so this query becomes valid:How was this patch tested?
Added new UTs.