Skip to content

Conversation

@peter-toth
Copy link
Contributor

@peter-toth peter-toth commented Jun 10, 2019

What changes were proposed in this pull request?

This PR adds support of WITH clause within a subquery so this query becomes valid:

SELECT max(c) FROM (
  WITH t AS (SELECT 1 AS c)
  SELECT * FROM t
)

How was this patch tested?

Added new UTs.

@dongjoon-hyun
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jun 10, 2019

Test build #106360 has finished for PR 24831 at commit 741c727.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

cc @gatorsmile since this is a part of PostgreSQL feature parity.

@SparkQA
Copy link

SparkQA commented Jun 13, 2019

Test build #106461 has finished for PR 24831 at commit 0b516fc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 15, 2019

Test build #106541 has finished for PR 24831 at commit ca27852.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 22, 2019

Test build #106793 has finished for PR 24831 at commit d76a265.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mgaido91! It is not needed indeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember now why I used lazy here. A CTE definition can be used multiple times in WITH but the call by name parameter (ctePlan = traverseAndSubstituteCTE(...)) should be executed only once.
But now I believe it is better to use lazy outside of substituteCTE than inside, please review my commit 7d69105.

@SparkQA
Copy link

SparkQA commented Jul 1, 2019

Test build #107069 has finished for PR 24831 at commit 9096c4d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 1, 2019

Test build #107073 has finished for PR 24831 at commit fd9ca42.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 1, 2019

Test build #107079 has finished for PR 24831 at commit 39b9596.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 1, 2019

Test build #107084 has finished for PR 24831 at commit 7d69105.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need this flag? Some tests would fail if we have not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it would not, but I wanted to do CTE substitution in the current plan only (not in the subqueries) if it is safe. (CTE substitution will run for subqueries later anyway.)

@maropu
Copy link
Member

maropu commented Jul 2, 2019

@peter-toth Can we split this pr into two parts cleanly: 1. the sub-query support and 2. the behaviour change? Also, I think we need to update the migration guide for the part 2. cc: @gatorsmile

@peter-toth peter-toth changed the title [SPARK-19799][SQL] Better support for WITH clause [SPARK-19799][SQL] Support WITH clause in subqueries Jul 2, 2019
@peter-toth
Copy link
Contributor Author

peter-toth commented Jul 2, 2019

@peter-toth Can we split this pr into two parts cleanly: 1. the sub-query support and 2. the behaviour change? Also, I think we need to update the migration guide for the part 2. cc: @gatorsmile

@maropu, all right, I dropped the changes that relate to order of substitution and will do it in another ticket. What remained here is just the WITH support in subqueries.

@maropu
Copy link
Member

maropu commented Jul 2, 2019

Yea, thanks!

@peter-toth
Copy link
Contributor Author

Yea, thanks!

Very welcome.
I moved substitution order related changes to #25029, but it is still WIP and this one should be reviewed first.

@SparkQA
Copy link

SparkQA commented Jul 2, 2019

Test build #107101 has finished for PR 24831 at commit cff4bec.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 2, 2019

Test build #107106 has finished for PR 24831 at commit 85e39d4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/**
* Analyze WITH nodes and substitute child plan with CTE definitions.
*/
object CTESubstitution extends Rule[LogicalPlan] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid moving the class, in order to keep the diff smaller?

Copy link
Contributor Author

@peter-toth peter-toth Jul 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of moving the rule to a separate file came from here: #24831 (comment), but I think you are right @mgaido91, because we cut the scope and split the PR since that. Maybe the other part (#25029) could extract the rule to a separate file as that one makes the rule a bit more complicated. Does that work for you @maropu?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was what I meant, we can move the rule in the other PR which refactors it more thoroughly.

plan resolveOperatorsUp {
case UnresolvedRelation(Seq(table)) if resolver(cteName, table) =>
ctePlan
case u: UnresolvedRelation =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you remove this?

Copy link
Contributor Author

@peter-toth peter-toth Jul 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this line does anything nor UnresolvedRelation can have an expression so I thought it is safe and good idea to remove the line. Please correct me if I'm wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think you're right, I was just curious about the reason of this change

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only a style comment, otherwise LGTM, thanks!

plan resolveOperatorsUp {
case UnresolvedRelation(Seq(table)) if resolver(cteName, table) =>
ctePlan
case u: UnresolvedRelation =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think you're right, I was just curious about the reason of this change

-- !query 20 schema
struct<>
-- !query 20 output
org.apache.spark.sql.AnalysisException
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question, is this going to be addressed in the PR which allows recursive subqueries or is this an invalid query?

Copy link
Contributor Author

@peter-toth peter-toth Jul 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a WIP PR open #23531 that would add support for recursive queries (and subqueries and subquery expressions too). But these queries lack the RECURSIVE keyword and using an outer recursive reference in a subquery is not allowed (next query) according to the SQL standard so these will never become valid.

But, this PR should be accepted first then could come #25029 and #23531

Actually I think I'm removing the test WITH r AS (SELECT * FROM r) SELECT * FROM r; because there is already a similar one in cte.sql and moving the WITH r AS (SELECT (SELECT * FROM r)) SELECT * FROM r; next to the existing one.

@SparkQA
Copy link

SparkQA commented Jul 3, 2019

Test build #107168 has finished for PR 24831 at commit 837c776.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 3, 2019

Test build #107175 has finished for PR 24831 at commit d0c57c8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 3, 2019

Test build #107178 has finished for PR 24831 at commit 9ec6eaf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

struct<1:int>
-- !query 12 output
1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This result is different from the pg one;


postgres=# WITH
postgres-#   t AS (SELECT 1),
postgres-#   t2 AS (
postgres(#     WITH t AS (SELECT 2)
postgres(#     SELECT * FROM t
postgres(#   )
postgres-# SELECT * FROM t2;
 ?column? 
----------
        2
(1 row)

This will be address in the following #25029?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree that this is inevitable in this PR. (cc @gatorsmile ).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @peter-toth , @maropu , @mgaido91 .
The original PR is split into two (this and #25029) according to the review comment.
This is a new feature at Spark 3.0.0 and will be consistent with PostgreSQL soon.
Merged to master to move forward.

@peter-toth
Copy link
Contributor Author

Thanks @dongjoon-hyun, @maropu, @mgaido91 for the review! I will prepare #25029 for review soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants