Skip to content

Conversation

@huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

Push down Min/Max/Count to Parquet

Why are the changes needed?

Since parquet has the statistics information for min, max and count, we want to take advantage of this info and push down Min/Max/Count to parquet layer for better performance.

Does this PR introduce any user-facing change?

Yes, SQLConf.PARQUET_AGGREGATE_PUSHDOWN_ENABLED was added. If this is set to true, we will push down Min/Max/Count to Parquet.

How was this patch tested?

New tests were added.

@github-actions github-actions bot added the SQL label Apr 4, 2021
@SparkQA
Copy link

SparkQA commented Apr 4, 2021

Test build #136892 has finished for PR 32049 at commit e8c90af.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 4, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41469/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious - would anyone ever not want to push it down?
I'm surprised, I thought we already did this!
CC @cloud-fan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Hello Sean :)
Actually we only have filter push down for parquet, not aggregate push down yet. I will probably change the default to true after this PR gets reviewed and fully tested.

@huaxingao huaxingao force-pushed the parquet-agg-pushdown branch from e8c90af to 82b4592 Compare April 4, 2021 16:13
@SparkQA
Copy link

SparkQA commented Apr 4, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41474/

@SparkQA
Copy link

SparkQA commented Apr 4, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41474/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion: Can we reuse PushableColumn which is used by predicate pushdown to capture pushed column?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo? SELECT (*)? You mean SELECT count(*) FROM table -> SELECT count(1) FROM table?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For key methods added here, can you add some descriptive comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: aggResultToSparkInternalRows => createInternalRowFromAggResult?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values(i).asInstanceOf[Integer]? Or values(i).asInstanceOf[Long]? It is PrimitiveTypeName.INT64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be values(i).asInstanceOf[Long]. Fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we use aggregate pushdown for Parquet, we cannot use vectorized Parquet reader, right? Can you describe it too in the config doc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that it's supported to read aggregation result into a ColumnarBatch below in buildColumnarReader. So we can still do aggregation push down with vectorized reader enabled right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn't matter if the vectorized reader is enabled or not. Since we are reading the statistics information from the parquet footer, we don't really create a VectorizedReader. But if columnar reader is enabled, we return a ColumnarBatch instead of a InternalRow.

Comment on lines 393 to 395
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I misunderstand it.

Seems that this method reads each block then get aggregated results for each aggregate function. The aggregated results are put into an array.

Consider two aggregate functions max(col1) and min(col1), So the array content looks like [max(col1), min(col2)].

How does this deal with more than one block case? Seems this method puts aggregated results sequentially like [max(col1) for block1, min(col2) for block1, max(col1) for block2, min(col2) for block2, ...]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't do this right. Will fix this.

@SparkQA
Copy link

SparkQA commented Apr 4, 2021

Test build #136897 has finished for PR 32049 at commit 82b4592.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41479/

@SparkQA
Copy link

SparkQA commented Apr 5, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41479/

@SparkQA
Copy link

SparkQA commented Apr 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41481/

@SparkQA
Copy link

SparkQA commented Apr 5, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41481/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall it be an internal config or not? Do we expect this one to be user-facing and tune it frequently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing!
I think this should be similar to PARQUET_FILTER_PUSHDOWN_ENABLED and be a user-facing config. I guess we can default it to true in the future after we have more testing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just want to double check, Parquet will always make sure the min/max statistics to be presented in footer, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I actually need to check if Parquet returns the min/max statistics. If not, I will either throw Exception or fall back to the no push down way. I think fall back is a better solution.

Copy link
Contributor Author

@huaxingao huaxingao Apr 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find a good way to fall back. We won't be able to read footer until FilePartitionReaderFactory.createReader, that's when we can get a partition of the file to read. Seems to me it's too late to fall back at that time. I looked at Presto parquet partial aggregation implementation, and it throws Exception. I will throw Exception for now. If anybody has a better idea, please let me know.

@SparkQA
Copy link

SparkQA commented Apr 5, 2021

Test build #136902 has finished for PR 32049 at commit 246103c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 5, 2021

Test build #136904 has finished for PR 32049 at commit bf99c52.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

@huaxingao Can you briefly introduce the new aggregate pushdown framework? How do we push down aggregate through different operators and eventually hit the scan node? Do we support both partial+final and global aggregate?

@huaxingao
Copy link
Contributor Author

@cloud-fan I will have a SPIP for this.

@huaxingao huaxingao force-pushed the parquet-agg-pushdown branch from 4ecabfb to d9dc0ba Compare April 12, 2021 00:53
@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41761/

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41761/

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Test build #137188 has finished for PR 32049 at commit d9dc0ba.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41767/

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Test build #137183 has finished for PR 32049 at commit 4ecabfb.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41771/

@SparkQA
Copy link

SparkQA commented Apr 12, 2021

Test build #137192 has finished for PR 32049 at commit a0ad0d1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is pushdown used?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If children are all push down Count, I think they are all not nullable because Count is not nullable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is different than other push down aggregation functions. Can you add a comment here why Count needs to overwrite its updateExpressions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for your comments. I will rewrite Count to Sum for pushed down Count to minimize code change, based on our off-line discussion.

// +- RelationV2[min(c1)#21, max(c1)#22] parquet file ...
var index = 0
val output = resultExpressions.map {
case Alias(_, name) =>
Copy link
Contributor

@cloud-fan cloud-fan Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this corrected? if the query is SELECT max(c) + min(c) as res FROM t, what we push down is max(c) and min(c), and the expected output of the scan relation should be max(c)#id and min(c)#id, instead of res#id.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea to construct the output:

val newOutput = scan.readSchema().toAttributes
val groupAttrs = groupingExpressions.zip(newOutput).map {
  case (a: Attribute, b: Attribute) => b.withExprId(a.exprId)
  case other => b
}
val output = groupAttrs ++ newOutput.drop(groupAttrs.length)

val aggregates = resultExpressions.flatMap { expr =>
expr.collect {
case agg: AggregateExpression =>
replaceAlias(agg, getAliasMap(project)).asInstanceOf[AggregateExpression]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since project.forall(_.isInstanceOf[AttributeReference]), I don't think we need to de-alias any more.

translatedFilters: Seq[sources.Filter],
handledFilters: Seq[sources.Filter]) extends Scan {
handledFilters: Seq[sources.Filter],
pushedAggregates: Aggregation) extends Scan {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we put it here if we are not able to support it?

val translatedAggregates = aggregates.map(DataSourceStrategy.translateAggregate)
val translatedGroupBys = groupBy.map(columnAsString)

val agg = Aggregation(translatedAggregates.flatten, translatedGroupBys.flatten)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can only apply pushdown if all the group by cols are supported. e.g. GROUP BY a, substring(b), c, it's wrong to pushdown GROUP BY a, c

@SparkQA
Copy link

SparkQA commented Jul 13, 2021

Test build #140982 has finished for PR 32049 at commit 7540b59.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 13, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45497/

@SparkQA
Copy link

SparkQA commented Jul 13, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45498/

@SparkQA
Copy link

SparkQA commented Jul 13, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45498/

@SparkQA
Copy link

SparkQA commented Jul 14, 2021

Test build #140984 has finished for PR 32049 at commit 2c889c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 14, 2021

Test build #141033 has finished for PR 32049 at commit 5c2b630.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jul 14, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45547/

@SparkQA
Copy link

SparkQA commented Jul 14, 2021

Test build #141034 has finished for PR 32049 at commit 5c2b630.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jul 14, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45547/

@SparkQA
Copy link

SparkQA commented Jul 14, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45549/

@SparkQA
Copy link

SparkQA commented Jul 14, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45549/

@SparkQA
Copy link

SparkQA commented Jul 14, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45550/

@SparkQA
Copy link

SparkQA commented Jul 14, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45550/

@SparkQA
Copy link

SparkQA commented Jul 15, 2021

Test build #141035 has finished for PR 32049 at commit 5c2b630.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

Per offline discussion with @cloud-fan, we will split this PR into two PRs: the first one will add interface and APIs, and the second one will add Parquet implementation. I will close this PR for now. Thanks every one for reviewing!

Here is the first PR #33352

@huaxingao huaxingao closed this Jul 15, 2021
cloud-fan pushed a commit that referenced this pull request Jul 26, 2021
### What changes were proposed in this pull request?
Add interfaces and APIs to push down Aggregates to V2 Data Source

### Why are the changes needed?
improve performance

### Does this PR introduce _any_ user-facing change?
SQLConf.PARQUET_AGGREGATE_PUSHDOWN_ENABLED was added. If this is set to true, Aggregates are pushed down to Data Source.

### How was this patch tested?
New tests were added to test aggregates push down in #32049.  The original PR is split into two PRs. This PR doesn't contain new tests.

Closes #33352 from huaxingao/aggPushDownInterface.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Jul 26, 2021
### What changes were proposed in this pull request?
Add interfaces and APIs to push down Aggregates to V2 Data Source

### Why are the changes needed?
improve performance

### Does this PR introduce _any_ user-facing change?
SQLConf.PARQUET_AGGREGATE_PUSHDOWN_ENABLED was added. If this is set to true, Aggregates are pushed down to Data Source.

### How was this patch tested?
New tests were added to test aggregates push down in #32049.  The original PR is split into two PRs. This PR doesn't contain new tests.

Closes #33352 from huaxingao/aggPushDownInterface.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit c561ee6)
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants