Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This is a followup of #33352 , to simplify the JDBC aggregate pushdown:

  1. We should get the schema of the aggregate query by asking the JDBC server, instead of calculating it by ourselves. This can simplify the code a lot, and is also more robust: the data type of SUM may vary in different databases, it's fragile to assume they are always the same as Spark.
  2. because of 1, now we can remove the dataType property from the public Sum expression.

This PR also contains some small improvements:

  1. Spark should deduplicate the aggregate expressions before pushing them down.
  2. Improve the toString of public aggregate expressions to make them more SQL.

Why are the changes needed?

code and API simplification

Does this PR introduce any user-facing change?

this API is not released yet.

How was this patch tested?

existing tests

@github-actions github-actions bot added the SQL label Jul 29, 2021
@cloud-fan
Copy link
Contributor Author

cc @huaxingao @viirya

@SparkQA
Copy link

SparkQA commented Jul 29, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46361/

@SparkQA
Copy link

SparkQA commented Jul 29, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46361/

case agg: AggregateExpression => agg
// Do not push down duplicated aggregate expressions. For example,
// `SELECT max(a) + 1, max(a) + 2 FROM ...`, we should only push down one
// `sum(a)` to the data source.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: max(a)?

@huaxingao
Copy link
Contributor

java linter failed because of Unused import - org.apache.spark.sql.types.DataType in Sum.

Thanks a lot for improving the code! Really appreciate your help!

- JDBCOptions.JDBC_UPPER_BOUND +
(JDBCOptions.JDBC_QUERY_STRING -> aggQuery))
try {
finalSchema = JDBCRDD.resolveTable(jdbcOptionsWithAggQuery)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is a good change.

@SparkQA
Copy link

SparkQA commented Jul 29, 2021

Test build #141848 has finished for PR 33579 at commit 5523402.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Jul 30, 2021

CI pending.

@SparkQA
Copy link

SparkQA commented Jul 30, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46380/

@SparkQA
Copy link

SparkQA commented Jul 30, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46380/

@viirya
Copy link
Member

viirya commented Jul 30, 2021

Thanks. Merging master/3.2.

@viirya viirya closed this in 387a251 Jul 30, 2021
viirya pushed a commit that referenced this pull request Jul 30, 2021
### What changes were proposed in this pull request?

This is a followup of #33352 , to simplify the JDBC aggregate pushdown:
1. We should get the schema of the aggregate query by asking the JDBC server, instead of calculating it by ourselves. This can simplify the code a lot, and is also more robust: the data type of SUM may vary in different databases, it's fragile to assume they are always the same as Spark.
2. because of 1, now we can remove the `dataType` property from the public `Sum` expression.

This PR also contains some small improvements:
1. Spark should deduplicate the aggregate expressions before pushing them down.
2. Improve the `toString` of public aggregate expressions to make them more SQL.

### Why are the changes needed?

code and API simplification

### Does this PR introduce _any_ user-facing change?

this API is not released yet.

### How was this patch tested?

existing tests

Closes #33579 from cloud-fan/dsv2.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
(cherry picked from commit 387a251)
Signed-off-by: Liang-Chi Hsieh <[email protected]>
@SparkQA
Copy link

SparkQA commented Jul 30, 2021

Test build #141871 has finished for PR 33579 at commit 44a5583.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants