[SPARK-34952][SQL][FOLLOWUP] Simplify JDBC aggregate pushdown #33579

cloud-fan · 2021-07-29T16:18:49Z

What changes were proposed in this pull request?

This is a followup of #33352 , to simplify the JDBC aggregate pushdown:

We should get the schema of the aggregate query by asking the JDBC server, instead of calculating it by ourselves. This can simplify the code a lot, and is also more robust: the data type of SUM may vary in different databases, it's fragile to assume they are always the same as Spark.
because of 1, now we can remove the dataType property from the public Sum expression.

This PR also contains some small improvements:

Spark should deduplicate the aggregate expressions before pushing them down.
Improve the toString of public aggregate expressions to make them more SQL.

Why are the changes needed?

code and API simplification

Does this PR introduce any user-facing change?

this API is not released yet.

How was this patch tested?

existing tests

cloud-fan · 2021-07-29T16:20:21Z

cc @huaxingao @viirya

SparkQA · 2021-07-29T17:13:38Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46361/

SparkQA · 2021-07-29T17:47:52Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46361/

huaxingao · 2021-07-29T18:21:23Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

-                  case agg: AggregateExpression => agg
+                  // Do not push down duplicated aggregate expressions. For example,
+                  // `SELECT max(a) + 1, max(a) + 2 FROM ...`, we should only push down one
+                  // `sum(a)` to the data source.


nit: max(a)?

huaxingao · 2021-07-29T19:08:17Z

java linter failed because of Unused import - org.apache.spark.sql.types.DataType in Sum.

Thanks a lot for improving the code! Really appreciate your help!

viirya · 2021-07-29T19:17:16Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala

+        - JDBCOptions.JDBC_UPPER_BOUND +
+        (JDBCOptions.JDBC_QUERY_STRING -> aggQuery))
+    try {
+      finalSchema = JDBCRDD.resolveTable(jdbcOptionsWithAggQuery)


Oh, this is a good change.

SparkQA · 2021-07-29T21:27:34Z

Test build #141848 has finished for PR 33579 at commit 5523402.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-07-30T03:15:28Z

CI pending.

SparkQA · 2021-07-30T04:18:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46380/

SparkQA · 2021-07-30T05:09:05Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46380/

viirya · 2021-07-30T07:26:06Z

Thanks. Merging master/3.2.

### What changes were proposed in this pull request? This is a followup of #33352 , to simplify the JDBC aggregate pushdown: 1. We should get the schema of the aggregate query by asking the JDBC server, instead of calculating it by ourselves. This can simplify the code a lot, and is also more robust: the data type of SUM may vary in different databases, it's fragile to assume they are always the same as Spark. 2. because of 1, now we can remove the `dataType` property from the public `Sum` expression. This PR also contains some small improvements: 1. Spark should deduplicate the aggregate expressions before pushing them down. 2. Improve the `toString` of public aggregate expressions to make them more SQL. ### Why are the changes needed? code and API simplification ### Does this PR introduce _any_ user-facing change? this API is not released yet. ### How was this patch tested? existing tests Closes #33579 from cloud-fan/dsv2. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]> (cherry picked from commit 387a251) Signed-off-by: Liang-Chi Hsieh <[email protected]>

SparkQA · 2021-07-30T08:46:05Z

Test build #141871 has finished for PR 33579 at commit 44a5583.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

simplify JDBC aggregate pushdown

5523402

github-actions bot added the SQL label Jul 29, 2021

huaxingao reviewed Jul 29, 2021

View reviewed changes

huaxingao approved these changes Jul 29, 2021

View reviewed changes

viirya reviewed Jul 29, 2021

View reviewed changes

viirya approved these changes Jul 29, 2021

View reviewed changes

simplification

44a5583

viirya approved these changes Jul 30, 2021

View reviewed changes

viirya closed this in 387a251 Jul 30, 2021

[SPARK-34952][SQL][FOLLOWUP] Simplify JDBC aggregate pushdown #33579

[SPARK-34952][SQL][FOLLOWUP] Simplify JDBC aggregate pushdown #33579

Uh oh!

Conversation

cloud-fan commented Jul 29, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan commented Jul 29, 2021

Uh oh!

SparkQA commented Jul 29, 2021

Uh oh!

SparkQA commented Jul 29, 2021

Uh oh!

huaxingao Jul 29, 2021

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Jul 29, 2021

Uh oh!

viirya Jul 29, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 29, 2021

Uh oh!

viirya commented Jul 30, 2021

Uh oh!

SparkQA commented Jul 30, 2021

Uh oh!

SparkQA commented Jul 30, 2021

Uh oh!

viirya commented Jul 30, 2021

Uh oh!

SparkQA commented Jul 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants