[SPARK-36194][SQL] Add a logical plan visitor to propagate the distinct attributes #35779

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

wangyum wants to merge 16 commits into apache:master from wangyum:SPARK-36194

Member

wangyum commented Mar 9, 2022 •

edited

Loading

What changes were proposed in this pull request?

This pr add a new logical plan visitor named DistinctKeyVisitor to find out all the distinct attributes in current logical plan. For example:

spark.sql("CREATE TABLE t(a int, b int, c int) using parquet")
spark.sql("SELECT a, b, a % 10, max(c), sum(b) FROM t GROUP BY a, b").queryExecution.analyzed.distinctKeys

The output is: {a#1, b#2}.

Enhance RemoveRedundantAggregates to remove the aggregation if it is groupOnly and the child can guarantee distinct. For example:

set spark.sql.autoBroadcastJoinThreshold=-1; -- avoid PushDownLeftSemiAntiJoin
create table t1 using parquet as select id a, id as b from range(10);
create table t2 using parquet as select id as a, id as b from range(8);
select t11.a, t11.b from (select distinct a, b from t1) t11 left semi join t2 on (t11.a = t2.a) group by t11.a, t11.b;

Before this PR:

== Optimized Logical Plan ==
Aggregate [a#6L, b#7L], [a#6L, b#7L], Statistics(sizeInBytes=1492.0 B)
+- Join LeftSemi, (a#6L = a#8L), Statistics(sizeInBytes=1492.0 B)
   :- Aggregate [a#6L, b#7L], [a#6L, b#7L], Statistics(sizeInBytes=1492.0 B)
   :  +- Filter isnotnull(a#6L), Statistics(sizeInBytes=1492.0 B)
   :     +- Relation default.t1[a#6L,b#7L] parquet, Statistics(sizeInBytes=1492.0 B)
   +- Project [a#8L], Statistics(sizeInBytes=984.0 B)
      +- Filter isnotnull(a#8L), Statistics(sizeInBytes=1476.0 B)
         +- Relation default.t2[a#8L,b#9L] parquet, Statistics(sizeInBytes=1476.0 B)

After this PR:

== Optimized Logical Plan ==
Join LeftSemi, (a#6L = a#8L), Statistics(sizeInBytes=1492.0 B)
:- Aggregate [a#6L, b#7L], [a#6L, b#7L], Statistics(sizeInBytes=1492.0 B)
:  +- Filter isnotnull(a#6L), Statistics(sizeInBytes=1492.0 B)
:     +- Relation default.t1[a#6L,b#7L] parquet, Statistics(sizeInBytes=1492.0 B)
+- Project [a#8L], Statistics(sizeInBytes=984.0 B)
   +- Filter isnotnull(a#8L), Statistics(sizeInBytes=1476.0 B)
      +- Relation default.t2[a#8L,b#9L] parquet, Statistics(sizeInBytes=1476.0 B)

Why are the changes needed?

Improve query performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test and TPC-DS benchmark test.

SQL	Before this PR(Seconds)	After this PR(Seconds)
q14a	206	193
q38	59	41
q87	127	113

github-actions bot added the SQL label

wangyum mentioned this pull request

[SPARK-36194][SQL] Add a logical plan visitor to propagate the distinct attributes #35651

Closed

wangyum force-pushed the SPARK-36194 branch from 06bf75b to a900b20 Compare

March 9, 2022 03:29

wangyum closed this

wangyum reopened this

wangyum closed this

wangyum reopened this

cloud-fan reviewed

View reviewed changes

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregates.scala Outdated

Contributor

cloud-fan Mar 9, 2022

I still don't get why child.deterministic is required. If the child output is completely random, then it should not report any distinct keys.

Member Author

wangyum Mar 9, 2022

After thinking twice, it is not required.

cloud-fan reviewed

View reviewed changes

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregates.scala Outdated

Contributor

cloud-fan Mar 9, 2022

isn't it just a.foldable?

Member Author

wangyum Mar 9, 2022

No. Alias is'n foldable, but the child is foldable.

cloud-fan reviewed

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctKeyVisitor.scala Outdated Show resolved Hide resolved

cloud-fan reviewed

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctKeyVisitor.scala Outdated

Contributor

cloud-fan Mar 9, 2022

I think we should only do this filter in if (aliases.isEmpty). The distinct key can be a + b and the project list may have a + b AS c, then the result distinct key should be c.

Member Author

wangyum Mar 9, 2022

Line 49 also need it: https://github.com/apache/spark/blob/a900b20c03488a53dd67594b0fd5509281f1aa72/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctKeyVisitor.scala#L49

Contributor

cloud-fan Mar 9, 2022

Then do the filter again in L49. It's incorrect to do this filter this early.

Member Author

wangyum Mar 10, 2022

OK

cloud-fan reviewed

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctKeyVisitor.scala Outdated

Contributor

cloud-fan Mar 9, 2022

why p.deterministic is required?

Member Author

wangyum Mar 9, 2022

Removed it.

cloud-fan reviewed

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctKeyVisitor.scala Outdated

Contributor

cloud-fan Mar 9, 2022

ditto

Member Author

wangyum Mar 9, 2022

Removed it.

cloud-fan reviewed

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctKeyVisitor.scala Outdated

Contributor

cloud-fan Mar 9, 2022

isn't p.child.distinctKeys already a Set[ExpressionSet]?

Member Author

wangyum Mar 9, 2022

Yes.

cloud-fan reviewed

View reviewed changes

.../src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregatesSuite.scala Outdated

Contributor

cloud-fan Mar 9, 2022

hmm, how does this work? groupBy('a)('a, TrueLiteral) is not group only and not literal-only.

Member Author

wangyum Mar 9, 2022

RemoveRedundantAggregates already supported before this PR.

Contributor

cloud-fan Mar 9, 2022

ah this works because these two aggregates are adjacent. If they are not, we have a problem.

I'm thinking that we should refine Aggregate.groupOnly: .groupBy('a)('a, TrueLiteral) is also group only.

Member Author

wangyum Mar 10, 2022

#35795 to address this.

cloud-fan reviewed

View reviewed changes

.../src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregatesSuite.scala Outdated

Contributor

cloud-fan Mar 9, 2022

what can go wrong if we optimize this case?

Member Author

wangyum Mar 9, 2022

We can optimize this case.

cloud-fan reviewed

View reviewed changes

sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q38/explain.txt Outdated

Contributor

cloud-fan Mar 9, 2022

can you summarize the plan changes? do we have less shuffles now?

Member Author

wangyum Mar 9, 2022

q14a/a14b reduces 1 Exchange and 2 HashAggregates.
q38 reduces 3 Exchange and 4 HashAggregates.
q38 reduces 3 Exchange and 4 HashAggregates.

cloud-fan approved these changes

View reviewed changes

Contributor

cloud-fan left a comment

LGTM except for some minor comments

Contributor

cloud-fan commented Mar 9, 2022

Do we see regressions in any TPCDS queries?

wangyum added 11 commits

March 9, 2022 21:45


          Remove the aggregation from left semi/anti join if the same aggregati…

e5af37c

…on has already been done on left side


          Add more test

b703ae2


          grouping -> groupingExps

6c0bb58


          Add DistinctAttributesVisitor

fc52571


          Fix test name

8be677b


          Improve DistinctAttributesVisitor

16e55c1


          Fix test.

e9f28d8


          DistinctKeyVisitor

e9bf2fe


          Address comments

200e042


          Fix scala 2.13


          Address comments

e12fd14

wangyum added 2 commits

March 9, 2022 21:45


          Address all comments

33db6df


          Address all comments

19f7d72

wangyum force-pushed the SPARK-36194 branch from 06bfe26 to 19f7d72 Compare

March 9, 2022 13:46

fix

a7ce14d

Member Author

wangyum commented Mar 10, 2022

Do we see regressions in any TPCDS queries?

There is no regression.


          Merge remote-tracking branch 'upstream/master' into SPARK-36194-1234567

48a9a79

cloud-fan reviewed

View reviewed changes

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregates.scala Outdated

    
                      if agg.groupOnly && child.distinctKeys.exists(_.subsetOf(ExpressionSet(groupingExps))) =>

                    Project(agg.aggregateExpressions, child)

                  case agg @ Aggregate(groupingExps, aggregateExps, child)

Contributor

cloud-fan Mar 11, 2022

Do we still need this case? agg.groupOnly should cover it.

Member Author

wangyum Mar 11, 2022

Yes, we do not need it.


          Remove a case

e662655

cloud-fan approved these changes

View reviewed changes

wangyum closed this in

c16a66a

Member Author

wangyum commented Mar 14, 2022

Merged to master.

wangyum deleted the SPARK-36194 branch

March 14, 2022 13:59

cloud-fan mentioned this pull request

[SPARK-36194][SQL][FOLLOWUP] Propagate distinct keys more precisely #36100

Closed

wangyum pushed a commit that referenced this pull request


          [SPARK-36194][SQL][FOLLOWUP] Propagate distinct keys more precisely

fbe82fb

### What changes were proposed in this pull request?

This PR is a followup of #35779 , to propagate distinct keys more precisely in 2 cases:
1. For `LIMIT 1`, each output attribute is a distinct key, not the entire tuple.
2. For aggregate, we can still propagate distinct keys from child.

### Why are the changes needed?

make the optimization cover more cases

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests

Closes #36100 from cloud-fan/followup.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>

wangyum pushed a commit that referenced this pull request


          [SPARK-36194][SQL][FOLLOWUP] Propagate distinct keys more precisely

bdf76b6

### What changes were proposed in this pull request?

This PR is a followup of #35779 , to propagate distinct keys more precisely in 2 cases:
1. For `LIMIT 1`, each output attribute is a distinct key, not the entire tuple.
2. For aggregate, we can still propagate distinct keys from child.

### Why are the changes needed?

make the optimization cover more cases

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests

Closes #36100 from cloud-fan/followup.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
(cherry picked from commit fbe82fb)
Signed-off-by: Yuming Wang <[email protected]>

ulysses-you mentioned this pull request

[SPARK-38932][SQL] Datasource v2 support report distinct keys #36253

Closed

ulysses-you mentioned this pull request

[GLUTEN-4421][VL] Disable flushable aggregate when input is already partitioned by grouping keys apache/incubator-gluten#4443

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

SQL