[SPARK-36194][SQL] Remove the aggregation from left semi/anti join if the same aggregation has already been done on left side #33404

wangyum · 2021-07-17T15:59:39Z

What changes were proposed in this pull request?

This pr add a new logical plan visitor named DistinctAttributesVisitor to find out all the distinct attributes in current logical plan. For example:
```
spark.sql("CREATE TABLE t(a int, b int, c int) using parquet")
spark.sql("SELECT a, b, a % 10, a AS aliased_a, max(c), sum(b) FROM t GROUP BY a, b").queryExecution.analyzed.distinctKeys
```
The output is: {a#1, b#2}, {b#2, aliased_a#0}.

Enhance RemoveRedundantAggregates to remove the aggregation from left semi/anti join if the same aggregation has already been done on left side. For example:

set spark.sql.autoBroadcastJoinThreshold=-1; -- avoid PushDownLeftSemiAntiJoin
create table t1 using parquet as select id a, id as b from range(10);
create table t2 using parquet as select id as a, id as b from range(8);
select t11.a, t11.b from (select distinct a, b from t1) t11 left semi join t2 on (t11.a = t2.a) group by t11.a, t11.b;

Before this PR:

== Optimized Logical Plan ==
Aggregate [a#6L, b#7L], [a#6L, b#7L], Statistics(sizeInBytes=1492.0 B)
+- Join LeftSemi, (a#6L = a#8L), Statistics(sizeInBytes=1492.0 B)
   :- Aggregate [a#6L, b#7L], [a#6L, b#7L], Statistics(sizeInBytes=1492.0 B)
   :  +- Filter isnotnull(a#6L), Statistics(sizeInBytes=1492.0 B)
   :     +- Relation default.t1[a#6L,b#7L] parquet, Statistics(sizeInBytes=1492.0 B)
   +- Project [a#8L], Statistics(sizeInBytes=984.0 B)
      +- Filter isnotnull(a#8L), Statistics(sizeInBytes=1476.0 B)
         +- Relation default.t2[a#8L,b#9L] parquet, Statistics(sizeInBytes=1476.0 B)

After this PR:

== Optimized Logical Plan ==
Join LeftSemi, (a#6L = a#8L), Statistics(sizeInBytes=1492.0 B)
:- Aggregate [a#6L, b#7L], [a#6L, b#7L], Statistics(sizeInBytes=1492.0 B)
:  +- Filter isnotnull(a#6L), Statistics(sizeInBytes=1492.0 B)
:     +- Relation default.t1[a#6L,b#7L] parquet, Statistics(sizeInBytes=1492.0 B)
+- Project [a#8L], Statistics(sizeInBytes=984.0 B)
   +- Filter isnotnull(a#8L), Statistics(sizeInBytes=1476.0 B)
      +- Relation default.t2[a#8L,b#9L] parquet, Statistics(sizeInBytes=1476.0 B)

Why are the changes needed?

Improve query performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test and TPC-DS benchmark test.

SQL	Before this PR(Seconds)	After this PR(Seconds)
q14a	174	165
q38	26	23
q87	30	26

SparkQA · 2021-07-17T16:38:41Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45714/

...c/main/scala/org/apache/spark/sql/catalyst/optimizer/RemoveAggsThroughLeftSemiAntiJoin.scala

dongjoon-hyun · 2021-07-17T18:39:21Z

Thank you, @wangyum !

SparkQA · 2021-07-17T21:25:57Z

Test build #141202 has finished for PR 33404 at commit 5486d64.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-07-18T02:32:00Z

Also, cc @cloud-fan , @maropu , @viirya

dongjoon-hyun · 2021-07-19T06:12:21Z

Thank you for adding more test case, @wangyum .

...la/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregatesInLeftSemiAntiJoin.scala

SparkQA · 2021-07-19T08:40:29Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45744/

SparkQA · 2021-07-19T09:17:24Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45744/

SparkQA · 2021-07-19T12:24:55Z

Test build #141230 has finished for PR 33404 at commit 86a828a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-23T15:00:39Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46080/

SparkQA · 2021-07-23T19:03:01Z

Test build #141562 has finished for PR 33404 at commit a004207.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-24T01:54:49Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46110/

SparkQA · 2021-07-24T02:29:10Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46110/

SparkQA · 2021-07-24T05:46:14Z

Test build #141593 has finished for PR 33404 at commit 02a0bf3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-07-26T10:09:35Z

...t/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctAttributesVisitor.scala

We should use Set[ExpressionSet]. If we group by a, b and then select a, b, a as c, then the distinct keys shold be Set([a, b], [c, b])

SparkQA · 2021-07-28T08:17:31Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46272/

SparkQA · 2021-07-28T09:08:22Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46272/

SparkQA · 2021-07-28T12:10:20Z

Test build #141760 has finished for PR 33404 at commit 558aa31.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-29T06:21:51Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46321/

SparkQA · 2021-07-29T06:55:05Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46321/

SparkQA · 2021-07-29T10:20:59Z

Test build #141808 has finished for PR 33404 at commit eb71b8a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2021-08-03T07:38:24Z

Another similar query:

select distinct STATUS,RecordTypeId, count(oracle_id) as cnt_id from t group by 1,2

tanelk · 2021-10-24T19:04:29Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregates.scala

Does it have to be join specific? it looks like it should be able to handle any nodes. Ideally could remove the entire case upper @ Aggregate(_, _, lower: Aggregate) branch.

Ahh, I now noticed, that it was allready discussed

I think we can generalize it. We can leverage the propagated distinct keys and remove group-only aggregate (or turn it into project) if the group cols are already distinct.

...t/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctAttributesVisitor.scala

cloud-fan · 2021-10-26T08:38:35Z

...t/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctAttributesVisitor.scala

I'd prefer this

/** * Add a new ExpressionSet S into distinctKeys D. * To minimize the size of D: * 1. If there is a subset of S in D, return D. * 2. Otherwise, remove all the ExpressionSet containing S from D, and add the new one. */ private def addDistinctKey( keys: DistinctKeys, newExpressionSet: ExpressionSet): DistinctKeys = { if (keys.exists(_.subsetOf(newExpressionSet))) { keys } else { keys.filterNot(s => newExpressionSet.subsetOf(s)) + newExpressionSet } } /** * Propagate distinct keys with projectList. * For each alias in project list, replace the corresponding expression in distinctKeys. * To minimize the size of distinctKeys, remove all ExpressionSet that not subset of projectList. */ private def projectDistinctKeys( keys: DistinctKeys, projectList: Seq[NamedExpression]): DistinctKeys = { val outputSet = ExpressionSet(projectList.map(_.toAttribute)) val aliases = projectList.filter(_.isInstanceOf[Alias]) if (aliases.isEmpty) return keys.filter(_.subsetOf(outputSet)) val aliasedDistinctKeys = keys.map { expressionSet => expressionSet.map { expression => expression transform { case expr: Expression => aliases .collectFirst { case a: Alias if a.child.semanticEquals(expr) => a.toAttribute } .getOrElse(expr) } } } aliasedDistinctKeys.collect { case es: ExpressionSet if es.subsetOf(outputSet) => ExpressionSet(es) } } override def visitAggregate(p: Aggregate): Set[ExpressionSet] = { val distinctKeysWithGrouping = addDistinctKey(p.child.distinctKeys, ExpressionSet(p.groupingExpressions)) projectDistinctKeys(distinctKeysWithGrouping, p.aggregateExpressions) }

spark.sql("create table t1 (a int, b int, c int) using parquet") spark.sql("select a, b, a as e, b as f from t1 group by a, b").queryExecution.analyzed.distinctKeys

For such a query, which distinct keys do you prefer?

Set(ExpressionSet(e#0, f#1))

or

Set(ExpressionSet(a#2, b#3), ExpressionSet(a#2, f#1), ExpressionSet(b#3, e#0), ExpressionSet(e#0, f#1))

The latter.

I updated it to the following code to support it.

override def visitAggregate(p: Aggregate): Set[ExpressionSet] = { val groupingExps = ExpressionSet(p.groupingExpressions) // handle group by a, a val aggExpressions = p.aggregateExpressions.filter { case _: Attribute | _: Alias => true case _ => false } aggExpressions.toSet.subsets(groupingExps.size).filter { s => groupingExps.subsetOf(ExpressionSet(s.map { case a: Alias => a.child case o => o })) }.map(s => ExpressionSet(s.map(_.toAttribute))).toSet }

cloud-fan · 2021-10-26T08:39:40Z

...t/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctAttributesVisitor.scala

sample can also propagate the distinct keys from child

cloud-fan · 2021-10-26T08:41:25Z

...c/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanDistinctAttributes.scala

why not define this in DistinctKeyVisitor?

Moved it to DistinctKeyVisitor.

SparkQA · 2021-11-01T15:37:23Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49282/

SparkQA · 2021-11-01T15:46:56Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49282/

SparkQA · 2021-11-01T16:28:01Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49284/

SparkQA · 2021-11-01T16:59:21Z

Test build #144812 has finished for PR 33404 at commit 92be175.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2021-11-01T17:11:57Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49284/

SparkQA · 2021-11-01T19:16:24Z

Test build #144814 has finished for PR 33404 at commit fc11208.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-02T00:15:19Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49292/

SparkQA · 2021-11-02T00:53:48Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49292/

SparkQA · 2021-11-02T03:37:39Z

Test build #144823 has finished for PR 33404 at commit f1dec16.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

github-actions · 2022-02-11T00:12:45Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

cloud-fan · 2022-02-11T02:24:58Z

@wangyum do you have time to revisit this and pass all tests?

cloud-fan · 2022-02-21T08:14:49Z

...t/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctAttributesVisitor.scala

nit: DistinctKeyVisitor seems a simpler and more general name

BTW, I don't think the order of ExpressionSet matters, and Set[ExpressionSet] is better

and shall we make it a trait so that LogicalPlan can extend it directly? then we don't need LogicalPlanDistinctKeys. I think it's simpler if we only have one visitor implementation for distinct keys, which should be true.

1 to make it a trait.

cloud-fan · 2022-02-21T08:22:12Z

...t/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctAttributesVisitor.scala

I think we should propagate the distinct keys from the child as well. This should be done in other places as well so we need to add a method for it, e.g.

def addDistinctKey(keys: Set[AttributeSet], newExpressionSet: ExpressionSet): Set[AttributeSet]

The idea is: if keys already indicate the newExpressionSet, e.g. we have [a, b] in keys, we can ignore newExpressionSet if it's [a, b, c]. Else, we should clean up keys and add newExpressionSet. e.g. we have [a, b, c] in keys and the newExpressionSet is [a, b], then we should remove [a, b, c] from keys.

finally we can simply write val distinctKeys = addDistinctKey(p.child.distinctKeys, ExpressionSet(p.groupingExpressions)) here.

Does the distinct attributes related to the child distinct attributes? For example:

create table t(a int, b int, c int, d int, e int) using parquet; select a, b, c, sum(d) from (select distinct * from t) t1 group by a, b, c;

cloud-fan · 2022-02-21T08:28:30Z

...t/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctAttributesVisitor.scala

This is mostly for project list so maybe projectDistinctKeys is a better name: def projectDistinctKeys(keys: Set[ExpressionSet], projectList: Seq[NamedExpression]): Set[ExpressionSet]

cloud-fan · 2022-02-21T08:32:12Z

...t/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctAttributesVisitor.scala

This looks quite inefficient, to generate all the subsets. I think the logic here is:

produce correct distinct keys w.r.t. the alias mapping in the project list

prune invalid distinct keys that are not output by the project list.

Add a filter before build subsets:

val expressions = keys.flatMap(_.toSet) projectList.filter { case a: Alias => expressions.exists(_.semanticEquals(a.child)) case ne: NamedExpression => expressions.exists(_.semanticEquals(ne)) }.toSet.subsets(keys.map(_.size).min).filter { s => val references = s.map { case a: Alias => a.child case ne => ne } keys.exists(_.equals(ExpressionSet(references))) }.map(s => AttributeSet(s.map(_.toAttribute))).toSet

…on has already been done on left side

cloud-fan · 2022-02-24T05:29:30Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/QueryPlanDistinctKeys.scala

+import org.apache.spark.sql.catalyst.expressions.{Alias, AttributeSet, ExpressionSet, NamedExpression}
+
+
+trait QueryPlanDistinctKeys {


I prefer something like this

trait DistinctKeyVisitor extends LogicalPlanVisitor[Set[ExpressionSet]] { self: LogicalPlan => lazy val distinctKeys: DistinctKeys = { if (check conf) { visit(self) } else { default(self) } } def visitXXX }

The benefit is that we can centralize the distinct key logic in this file, not in many LogicalPlan classes.

New pull request: #35651

github-actions bot added the SQL label Jul 17, 2021

dongjoon-hyun reviewed Jul 17, 2021

View reviewed changes

...c/main/scala/org/apache/spark/sql/catalyst/optimizer/RemoveAggsThroughLeftSemiAntiJoin.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Jul 17, 2021

View reviewed changes

...c/main/scala/org/apache/spark/sql/catalyst/optimizer/RemoveAggsThroughLeftSemiAntiJoin.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Jul 17, 2021

View reviewed changes

...c/main/scala/org/apache/spark/sql/catalyst/optimizer/RemoveAggsThroughLeftSemiAntiJoin.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Jul 17, 2021

View reviewed changes

...c/main/scala/org/apache/spark/sql/catalyst/optimizer/RemoveAggsThroughLeftSemiAntiJoin.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 19, 2021

View reviewed changes

...la/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregatesInLeftSemiAntiJoin.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 26, 2021

View reviewed changes

tanelk reviewed Oct 24, 2021

View reviewed changes

cloud-fan reviewed Oct 26, 2021

View reviewed changes

...t/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctAttributesVisitor.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 26, 2021

View reviewed changes

github-actions bot added the Stale label Feb 11, 2022

cloud-fan removed the Stale label Feb 11, 2022

cloud-fan reviewed Feb 21, 2022

View reviewed changes

wangyum added 9 commits February 23, 2022 16:14

Remove the aggregation from left semi/anti join if the same aggregati…

49e62ff

…on has already been done on left side

Add more test

6a02605

grouping -> groupingExps

900837f

Add DistinctAttributesVisitor

c44f242

Fix test name

4bc92a6

Improve DistinctAttributesVisitor

1659465

Fix test.

8ac519f

Address comments

04e2efe

Add more tests

a1de7b5

cloud-fan reviewed Feb 24, 2022

View reviewed changes

This pull request was closed.

		import org.apache.spark.sql.catalyst.expressions.{Alias, AttributeSet, ExpressionSet, NamedExpression}


		trait QueryPlanDistinctKeys {

[SPARK-36194][SQL] Remove the aggregation from left semi/anti join if the same aggregation has already been done on left side #33404

[SPARK-36194][SQL] Remove the aggregation from left semi/anti join if the same aggregation has already been done on left side #33404

Uh oh!

Conversation

wangyum commented Jul 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jul 17, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dongjoon-hyun commented Jul 17, 2021

Uh oh!

SparkQA commented Jul 17, 2021

Uh oh!

dongjoon-hyun commented Jul 18, 2021

Uh oh!

dongjoon-hyun commented Jul 19, 2021

Uh oh!

Uh oh!

SparkQA commented Jul 19, 2021

Uh oh!

SparkQA commented Jul 19, 2021

Uh oh!

SparkQA commented Jul 19, 2021

Uh oh!

SparkQA commented Jul 23, 2021

Uh oh!

SparkQA commented Jul 23, 2021

Uh oh!

SparkQA commented Jul 24, 2021

Uh oh!

SparkQA commented Jul 24, 2021

Uh oh!

SparkQA commented Jul 24, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 28, 2021

Uh oh!

SparkQA commented Jul 28, 2021

Uh oh!

SparkQA commented Jul 28, 2021

Uh oh!

SparkQA commented Jul 29, 2021

Uh oh!

SparkQA commented Jul 29, 2021

Uh oh!

SparkQA commented Jul 29, 2021

Uh oh!

wangyum commented Aug 3, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangyum Oct 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

wangyum commented Jul 17, 2021 •

edited

Loading

wangyum Oct 30, 2021 •

edited

Loading

cloud-fan Feb 21, 2022 •

edited

Loading