[SPARK-34952][SQL] Aggregate (Min/Max/Count) push down for Parquet #32049

huaxingao · 2021-04-04T06:59:53Z

What changes were proposed in this pull request?

Push down Min/Max/Count to Parquet

Why are the changes needed?

Since parquet has the statistics information for min, max and count, we want to take advantage of this info and push down Min/Max/Count to parquet layer for better performance.

Does this PR introduce any user-facing change?

Yes, SQLConf.PARQUET_AGGREGATE_PUSHDOWN_ENABLED was added. If this is set to true, we will push down Min/Max/Count to Parquet.

How was this patch tested?

New tests were added.

SparkQA · 2021-04-04T07:10:45Z

Test build #136892 has finished for PR 32049 at commit e8c90af.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-04T07:16:35Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41469/

srowen · 2021-04-04T15:01:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Just curious - would anyone ever not want to push it down?
I'm surprised, I thought we already did this!
CC @cloud-fan

@srowen Hello Sean :)
Actually we only have filter push down for parquet, not aggregate push down yet. I will probably change the default to true after this PR gets reviewed and fully tested.

SparkQA · 2021-04-04T17:41:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41474/

SparkQA · 2021-04-04T17:41:51Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41474/

viirya · 2021-04-04T17:16:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

One suggestion: Can we reuse PushableColumn which is used by predicate pushdown to capture pushed column?

viirya · 2021-04-04T17:19:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

typo? SELECT (*)? You mean SELECT count(*) FROM table -> SELECT count(1) FROM table?

viirya · 2021-04-04T17:29:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala

For key methods added here, can you add some descriptive comment?

nit: aggResultToSparkInternalRows => createInternalRowFromAggResult?

viirya · 2021-04-04T17:33:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala

values(i).asInstanceOf[Integer]? Or values(i).asInstanceOf[Long]? It is PrimitiveTypeName.INT64.

should be values(i).asInstanceOf[Long]. Fixed.

viirya · 2021-04-04T17:43:41Z

...la/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala

So if we use aggregate pushdown for Parquet, we cannot use vectorized Parquet reader, right? Can you describe it too in the config doc?

It seems that it's supported to read aggregation result into a ColumnarBatch below in buildColumnarReader. So we can still do aggregation push down with vectorized reader enabled right?

I think it doesn't matter if the vectorized reader is enabled or not. Since we are reading the statistics information from the parquet footer, we don't really create a VectorizedReader. But if columnar reader is enabled, we return a ColumnarBatch instead of a InternalRow.

viirya · 2021-04-04T17:49:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala

Correct me if I misunderstand it.

Seems that this method reads each block then get aggregated results for each aggregate function. The aggregated results are put into an array.

Consider two aggregate functions max(col1) and min(col1), So the array content looks like [max(col1), min(col2)].

How does this deal with more than one block case? Seems this method puts aggregated results sequentially like [max(col1) for block1, min(col2) for block1, max(col1) for block2, min(col2) for block2, ...]?

Sorry, I didn't do this right. Will fix this.

SparkQA · 2021-04-04T21:30:31Z

Test build #136897 has finished for PR 32049 at commit 82b4592.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-05T06:51:13Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41479/

SparkQA · 2021-04-05T06:51:14Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41479/

SparkQA · 2021-04-05T07:46:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41481/

SparkQA · 2021-04-05T07:47:00Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41481/

c21 · 2021-04-05T07:47:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Shall it be an internal config or not? Do we expect this one to be user-facing and tune it frequently?

Thanks for reviewing!
I think this should be similar to PARQUET_FILTER_PUSHDOWN_ENABLED and be a user-facing config. I guess we can default it to true in the future after we have more testing.

c21 · 2021-04-05T07:58:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala

just want to double check, Parquet will always make sure the min/max statistics to be presented in footer, right?

Good question. I actually need to check if Parquet returns the min/max statistics. If not, I will either throw Exception or fall back to the no push down way. I think fall back is a better solution.

I can't find a good way to fall back. We won't be able to read footer until FilePartitionReaderFactory.createReader, that's when we can get a partition of the file to read. Seems to me it's too late to fall back at that time. I looked at Presto parquet partial aggregation implementation, and it throws Exception. I will throw Exception for now. If anybody has a better idea, please let me know.

SparkQA · 2021-04-05T10:21:32Z

Test build #136902 has finished for PR 32049 at commit 246103c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-05T11:24:44Z

Test build #136904 has finished for PR 32049 at commit bf99c52.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-04-06T13:56:05Z

@huaxingao Can you briefly introduce the new aggregate pushdown framework? How do we push down aggregate through different operators and eventually hit the scan node? Do we support both partial+final and global aggregate?

huaxingao · 2021-04-07T00:38:35Z

@cloud-fan I will have a SPIP for this.

SparkQA · 2021-04-12T01:09:57Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41761/

SparkQA · 2021-04-12T01:09:58Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41761/

SparkQA · 2021-04-12T01:33:49Z

Test build #137188 has finished for PR 32049 at commit d9dc0ba.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-12T01:36:05Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41767/

SparkQA · 2021-04-12T02:53:02Z

Test build #137183 has finished for PR 32049 at commit 4ecabfb.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2021-04-12T04:13:21Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41771/

SparkQA · 2021-04-12T07:50:25Z

Test build #137192 has finished for PR 32049 at commit a0ad0d1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-04-12T08:12:21Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PushDownCount.scala

Is pushdown used?

viirya · 2021-04-12T08:16:30Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PushDownCount.scala

If children are all push down Count, I think they are all not nullable because Count is not nullable?

viirya · 2021-04-12T08:17:59Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PushDownCount.scala

This is different than other push down aggregation functions. Can you add a comment here why Count needs to overwrite its updateExpressions?

Thank for your comments. I will rewrite Count to Sum for pushed down Count to minimize code change, based on our off-line discussion.

cloud-fan · 2021-07-12T17:15:07Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

+                  // +- RelationV2[min(c1)#21, max(c1)#22] parquet file ...
+                  var index = 0
+                  val output = resultExpressions.map {
+                    case Alias(_, name) =>


is this corrected? if the query is SELECT max(c) + min(c) as res FROM t, what we push down is max(c) and min(c), and the expected output of the scan relation should be max(c)#id and min(c)#id, instead of res#id.

One idea to construct the output:

val newOutput = scan.readSchema().toAttributes val groupAttrs = groupingExpressions.zip(newOutput).map { case (a: Attribute, b: Attribute) => b.withExprId(a.exprId) case other => b } val output = groupAttrs ++ newOutput.drop(groupAttrs.length)

cloud-fan · 2021-07-12T17:18:20Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

+    val aggregates = resultExpressions.flatMap { expr =>
+      expr.collect {
+        case agg: AggregateExpression =>
+          replaceAlias(agg, getAliasMap(project)).asInstanceOf[AggregateExpression]


since project.forall(_.isInstanceOf[AttributeReference]), I don't think we need to de-alias any more.

cloud-fan · 2021-07-12T17:19:37Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

    translatedFilters: Seq[sources.Filter],
-    handledFilters: Seq[sources.Filter]) extends Scan {
+    handledFilters: Seq[sources.Filter],
+    pushedAggregates: Aggregation) extends Scan {


why do we put it here if we are not able to support it?

cloud-fan · 2021-07-12T17:26:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala

+        val translatedAggregates = aggregates.map(DataSourceStrategy.translateAggregate)
+        val translatedGroupBys = groupBy.map(columnAsString)
+
+        val agg = Aggregation(translatedAggregates.flatten, translatedGroupBys.flatten)


I think we can only apply pushdown if all the group by cols are supported. e.g. GROUP BY a, substring(b), c, it's wrong to pushdown GROUP BY a, c

SparkQA · 2021-07-13T21:34:38Z

Test build #140982 has finished for PR 32049 at commit 7540b59.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-13T21:43:45Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45497/

SparkQA · 2021-07-13T22:24:38Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45498/

SparkQA · 2021-07-13T23:00:16Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45498/

SparkQA · 2021-07-14T02:05:28Z

Test build #140984 has finished for PR 32049 at commit 2c889c6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

SparkQA · 2021-07-14T19:43:33Z

Test build #141033 has finished for PR 32049 at commit 5c2b630.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2021-07-14T20:17:32Z

retest this please

SparkQA · 2021-07-14T20:33:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45547/

SparkQA · 2021-07-14T20:40:00Z

Test build #141034 has finished for PR 32049 at commit 5c2b630.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2021-07-14T21:05:33Z

retest this please

SparkQA · 2021-07-14T21:11:34Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45547/

SparkQA · 2021-07-14T21:24:47Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45549/

SparkQA · 2021-07-14T21:59:14Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45549/

SparkQA · 2021-07-14T22:20:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45550/

SparkQA · 2021-07-14T22:53:25Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45550/

SparkQA · 2021-07-15T02:02:48Z

Test build #141035 has finished for PR 32049 at commit 5c2b630.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2021-07-15T02:46:52Z

Per offline discussion with @cloud-fan, we will split this PR into two PRs: the first one will add interface and APIs, and the second one will add Parquet implementation. I will close this PR for now. Thanks every one for reviewing!

Here is the first PR #33352

### What changes were proposed in this pull request? Add interfaces and APIs to push down Aggregates to V2 Data Source ### Why are the changes needed? improve performance ### Does this PR introduce _any_ user-facing change? SQLConf.PARQUET_AGGREGATE_PUSHDOWN_ENABLED was added. If this is set to true, Aggregates are pushed down to Data Source. ### How was this patch tested? New tests were added to test aggregates push down in #32049. The original PR is split into two PRs. This PR doesn't contain new tests. Closes #33352 from huaxingao/aggPushDownInterface. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? Add interfaces and APIs to push down Aggregates to V2 Data Source ### Why are the changes needed? improve performance ### Does this PR introduce _any_ user-facing change? SQLConf.PARQUET_AGGREGATE_PUSHDOWN_ENABLED was added. If this is set to true, Aggregates are pushed down to Data Source. ### How was this patch tested? New tests were added to test aggregates push down in #32049. The original PR is split into two PRs. This PR doesn't contain new tests. Closes #33352 from huaxingao/aggPushDownInterface. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit c561ee6) Signed-off-by: Wenchen Fan <[email protected]>

github-actions bot added the SQL label Apr 4, 2021

srowen reviewed Apr 4, 2021

View reviewed changes

huaxingao force-pushed the parquet-agg-pushdown branch from e8c90af to 82b4592 Compare April 4, 2021 16:13

viirya reviewed Apr 4, 2021

View reviewed changes

c21 reviewed Apr 5, 2021

View reviewed changes

huaxingao force-pushed the parquet-agg-pushdown branch from 4ecabfb to d9dc0ba Compare April 12, 2021 00:53

viirya reviewed Apr 12, 2021

View reviewed changes

...alyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PushDownCount.scala Outdated

Copy link

Member

viirya Apr 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is pushdown used?

viirya reviewed Apr 12, 2021

View reviewed changes

cloud-fan reviewed Jul 12, 2021

View reviewed changes

huaxingao added 2 commits July 13, 2021 13:45

address comments

7540b59

fix build failure:

2c889c6

cloud-fan reviewed Jul 14, 2021

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 14, 2021

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala Outdated Show resolved Hide resolved

address comments

5c2b630

huaxingao mentioned this pull request Jul 15, 2021

[SPARK-34952][SQL] DSv2 Aggregate push down APIs #33352

Closed

huaxingao closed this Jul 15, 2021

[SPARK-34952][SQL] Aggregate (Min/Max/Count) push down for Parquet #32049

[SPARK-34952][SQL] Aggregate (Min/Max/Count) push down for Parquet #32049

Uh oh!

Conversation

huaxingao commented Apr 4, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Apr 4, 2021

Uh oh!

SparkQA commented Apr 4, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 4, 2021

Uh oh!

SparkQA commented Apr 4, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 4, 2021

Uh oh!

SparkQA commented Apr 5, 2021

Uh oh!

SparkQA commented Apr 5, 2021

Uh oh!

SparkQA commented Apr 5, 2021

Uh oh!

SparkQA commented Apr 5, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huaxingao Apr 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 5, 2021

Uh oh!

SparkQA commented Apr 5, 2021

Uh oh!

cloud-fan commented Apr 6, 2021

Uh oh!

huaxingao commented Apr 7, 2021

Uh oh!

SparkQA commented Apr 12, 2021

Uh oh!

SparkQA commented Apr 12, 2021

Uh oh!

SparkQA commented Apr 12, 2021

Uh oh!

huaxingao Apr 12, 2021 •

edited

Loading

cloud-fan Jul 12, 2021 •

edited

Loading