-
Notifications
You must be signed in to change notification settings - Fork 29k
SPARK-13827[SQL] Can't add subquery to an operator with same-name outputs while generate SQL string #11658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #52926 has finished for PR 11658 at commit
|
|
You are so fast! Will do the review tonight or tomorrow. I have another test case for this issue. Maybe you can take it. This is Project -- Subquery -- Filter -- Aggregate --... SELECT Count(a.value),
b.KEY,
a.KEY
FROM parquet_t1 a,
parquet_t1 b
GROUP BY a.KEY,
b.KEY
HAVING Max(a.KEY) > 0 |
|
Test build #52985 has finished for PR 11658 at commit
|
|
Test build #52987 has finished for PR 11658 at commit
|
| ) | ||
| case SQLTable(database, table, _, sample) => | ||
| val qualifiedName = s"${quoteIdentifier(database)}.${quoteIdentifier(table)}" | ||
| sample.map { case (lowerBound, upperBound) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this change, you can remove case Sample
SELECT x.key, x.value, y.a, y.b, y.c, y.d FROM parquet_t1 x JOIN parquet_t2 y ON x.key = y.aThe generated SQL is SELECT `gen_attr_30` AS `key`, `gen_attr_31` AS `value`, `gen_attr_32` AS `a`, `gen_attr_33` AS `b`, `gen_attr_34` AS `c`, `gen_attr_35` AS `d` FROM (SELECT `gen_attr_30`, `gen_attr_31`, `gen_attr_32`, `gen_attr_33`, `gen_attr_34`, `gen_attr_35` FROM (SELECT `key` AS `gen_attr_30`, `value` AS `gen_attr_31` FROM `default`.`parquet_t1`) AS gen_subquery_0 INNER JOIN (SELECT `a` AS `gen_attr_32`, `b` AS `gen_attr_33`, `c` AS `gen_attr_34`, `d` AS `gen_attr_35` FROM `default`.`parquet_t2`) AS gen_subquery_1 ON (`gen_attr_30` = `gen_attr_32`)) AS gen_subquery_2I compared the Optimized Logical Plan of these two queries: Here, we always add extra |
|
I think it's because our optimizer is not smart enough, these alias-only |
|
True. : ) |
|
Test build #52993 has finished for PR 11658 at commit
|
|
retest this please |
|
Test build #53058 has finished for PR 11658 at commit
|
|
retest this please. |
|
Test build #53172 has finished for PR 11658 at commit
|
|
Test build #53182 has finished for PR 11658 at commit
|
|
Test build #53203 has finished for PR 11658 at commit
|
| case _: LocalLimit => plan | ||
| case _: GlobalLimit => plan | ||
| case _: SQLTable => plan | ||
| case OneRowRelation => plan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we do not need to add a subquery for these kinds of nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the comments says, we don't need to add sub-query if this operator can be put after FROM. So obviously, SubqueryAlias, Join, SQLTable, OneRowRelation don't need extra sub-query. Currently we only support convert logical plan that is parsed from SQL string to SQL string, this implies, Filter, Limit will always appear after table relation and they will generate SQL string like tbl WHERE ... LIMIT ... which can be put after FROM.
Anyway this logical is just copied from original code.
| override def sql: String = { | ||
| val qualifiersString = | ||
| if (qualifiers.isEmpty) "" else qualifiers.map(quoteIdentifier).mkString("", ".", ".") | ||
| val aliasName = if (isGenerated) s"$name#${exprId.id}" else s"$name" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is isGenerated still needed? (if not, we do not need to remove it in this PR). Also, what is the reason of this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isGenerated is still needed to avoid resolving column names on these generated internal attributes. However, it should not affect the sql anymore, as this PR need to control the format of attribute names and alias names.
|
@cloud-fan It will be super helpful if we can have an example in the description as well as in the code. I feel this kind of changes is hard to fully understand without good examples. Thanks! |
|
Thank you for the example. That's super helpful. We should also put that in the code. Changes look good. @liancheng It will be good if you can take a look at it later. I am merging this to master. We can put this example in the comment while working on another PR related to view support. |
## What changes were proposed in this pull request? This PR adds SQL generation support for `Generate` operator. It always converts `Generate` operator into `LATERAL VIEW` format as there are many limitations to put UDTF in project list. This PR is based on #11658, please see the last commit to review the real changes. Thanks dilipbiswal for his initial work! Takes over #11596 ## How was this patch tested? new tests in `LogicalPlanToSQLSuite` Author: Wenchen Fan <[email protected]> Closes #11696 from cloud-fan/generate.
…utputs while generate SQL string
## What changes were proposed in this pull request?
This PR tries to solve a fundamental issue in the `SQLBuilder`. When we want to turn a logical plan into SQL string and put it after FROM clause, we need to wrap it with a sub-query. However, a logical plan is allowed to have same-name outputs with different qualifiers(e.g. the `Join` operator), and this kind of plan can't be put under a subquery as we will erase and assign a new qualifier to all outputs and make it impossible to distinguish same-name outputs.
To solve this problem, this PR renames all attributes with globally unique names(using exprId), so that we don't need qualifiers to resolve ambiguity anymore.
For example, `SELECT x.key, MAX(y.key) OVER () FROM t x JOIN t y`, we will parse this SQL to a Window operator and a Project operator, and add a sub-query between them. The generated SQL looks like:
```
SELECT sq_1.key, sq_1.max
FROM (
SELECT sq_0.key, sq_0.key, MAX(sq_0.key) OVER () AS max
FROM (
SELECT x.key, y.key FROM t1 AS x JOIN t2 AS y
) AS sq_0
) AS sq_1
```
You can see, the `key` columns become ambiguous after `sq_0`.
After this PR, it will generate something like:
```
SELECT attr_30 AS key, attr_37 AS max
FROM (
SELECT attr_30, attr_37
FROM (
SELECT attr_30, attr_35, MAX(attr_35) AS attr_37
FROM (
SELECT attr_30, attr_35 FROM
(SELECT key AS attr_30 FROM t1) AS sq_0
INNER JOIN
(SELECT key AS attr_35 FROM t1) AS sq_1
) AS sq_2
) AS sq_3
) AS sq_4
```
The outermost SELECT is used to turn the generated named to real names back, and the innermost SELECT is used to alias real columns to our generated names. Between them, there is no name ambiguity anymore.
## How was this patch tested?
existing tests and new tests in LogicalPlanToSQLSuite.
Author: Wenchen Fan <[email protected]>
Closes apache#11658 from cloud-fan/gensql.
## What changes were proposed in this pull request? This PR adds SQL generation support for `Generate` operator. It always converts `Generate` operator into `LATERAL VIEW` format as there are many limitations to put UDTF in project list. This PR is based on apache#11658, please see the last commit to review the real changes. Thanks dilipbiswal for his initial work! Takes over apache#11596 ## How was this patch tested? new tests in `LogicalPlanToSQLSuite` Author: Wenchen Fan <[email protected]> Closes apache#11696 from cloud-fan/generate.
What changes were proposed in this pull request?
This PR tries to solve a fundamental issue in the
SQLBuilder. When we want to turn a logical plan into SQL string and put it after FROM clause, we need to wrap it with a sub-query. However, a logical plan is allowed to have same-name outputs with different qualifiers(e.g. theJoinoperator), and this kind of plan can't be put under a subquery as we will erase and assign a new qualifier to all outputs and make it impossible to distinguish same-name outputs.To solve this problem, this PR renames all attributes with globally unique names(using exprId), so that we don't need qualifiers to resolve ambiguity anymore.
For example,
SELECT x.key, MAX(y.key) OVER () FROM t x JOIN t y, we will parse this SQL to a Window operator and a Project operator, and add a sub-query between them. The generated SQL looks like:You can see, the
keycolumns become ambiguous aftersq_0.After this PR, it will generate something like:
The outermost SELECT is used to turn the generated named to real names back, and the innermost SELECT is used to alias real columns to our generated names. Between them, there is no name ambiguity anymore.
How was this patch tested?
existing tests and new tests in LogicalPlanToSQLSuite.