[SPARK-38832][SQL] Remove unnecessary distinct in aggregate expression by distinctKeys #36117

ulysses-you · 2022-04-08T12:24:22Z

What changes were proposed in this pull request?

Make EliminateDistinct support eliminate distinct by child distinct keys.

Why are the changes needed?

We can remove the distinct in aggregate expression if the distinct semantics is guaranteed by child.

For example:

SELECT count(distinct c) FROM (
  SELECT c FROM t GROUP BY c
)

Does this PR introduce any user-facing change?

improve performance

How was this patch tested?

add test in EliminateDistinctSuite

ulysses-you · 2022-04-11T01:49:13Z

cc @wangyum @cloud-fan

cloud-fan · 2022-04-11T05:41:02Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateDistinctSuite.scala

+      .rebalance().groupBy()(countDistinct($"a") as "x", sumDistinct($"a") as "y").analyze
+    comparePlans(Optimize.execute(q2), q2)
+
+    // avoid remove double data type attr


what will go wrong if we optimize this case as well?

Physical Aggregate will wrap NormalizeNaNAndZero for float/double to handle NaN and -0.0, so It's result value might be different with the original expression ?

nvm, the child will also do the same thing. We do not need consider this at here.

wangyum

It doesn't work because EliminateDistinct is only executed once and it before EliminateSubqueryAliases:

=== Applying Rule org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases ===
 Aggregate [count(distinct id#3L) AS count(DISTINCT id)#5L]   Aggregate [count(distinct id#3L) AS count(DISTINCT id)#5L]
!+- SubqueryAlias __auto_generated_subquery_name              +- Aggregate [id#3L], [id#3L]
!   +- Aggregate [id#3L], [id#3L]                                +- Relation default.t[id#3L] parquet
!      +- SubqueryAlias spark_catalog.default.t               
!         +- Relation default.t[id#3L] parquet

ulysses-you · 2022-04-12T06:10:49Z

@wangyum good catch, seems EliminateDistinct should move to the place which after Finish Analysis batch. Why we put EliminateDistinct before Finish Analysis batch can be tracked since #18429 (comment). And after #29673, we move RewriteDistinctAggregates out of the Finish Analysis batch. So I think it's good to move this rule now.

sigmod · 2022-04-12T07:23:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+
+        case ae: AggregateExpression if ae.isDistinct &&
+          agg.child.distinctKeys.exists(
+            _.subsetOf(ExpressionSet(ae.aggregateFunction.children.filterNot(_.foldable)))) =>


Is it correct?

If input plan to this rule is:

SELECT a, count(distinct c) FROM ( SELECT distinct a, b, c FROM t ) GROUP BY a

Will the added case branch rewrite the plan to

SELECT a, count(c) FROM ( SELECT distinct a, b, c FROM t ) GROUP BY a

agg.child.distinctKeys is {a, b, c}
ExpressionSet(ae.aggregateFunction.children.filterNot(_.foldable)) is {c}.

the distinctKeys of distinct a, b, c is ExpressionSet(a, b, c) not ExpressionSet(a), ExpressionSet(b), ExpressionSet(c)

You're right. I forgot distinctKeys is a set of sets.

How about:

agg.child.distinctKeys.exists( key => !key.isEmpty() && key.subsetOf(ExpressionSet(ae.aggregateFunction.children.filterNot(_.foldable))))

Alternatively, we can do a require here to make sure that we never return an empty key:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanDistinctKeys.scala#L32

make sense, I add a require at LogicalPlanDistinctKeys

ulysses-you · 2022-04-12T08:57:56Z

CI link https://github.com/ulysses-you/spark/actions/runs/2153818468

cloud-fan · 2022-04-13T10:10:32Z

thanks, merging to master!

ulysses-you · 2022-04-13T10:40:24Z

thank you @cloud-fan @wangyum @sigmod

cloud-fan · 2022-04-19T17:28:27Z

...yst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanDistinctKeys.scala

-    if (conf.getConf(PROPAGATE_DISTINCT_KEYS_ENABLED)) DistinctKeyVisitor.visit(self) else Set.empty
+    if (conf.getConf(PROPAGATE_DISTINCT_KEYS_ENABLED)) {
+      val keys = DistinctKeyVisitor.visit(self)
+      require(keys.forall(_.nonEmpty))


Do we really need this require? It looks fine to have an empty set as the distinct keys, e.g. global aggregate without keys. It means that the entire data set is distinct (have at most one row), and EliminateDistinct is OK with empty set in distinct keys.

I think it's more about avoid some unexpected things. It will be a correctness issue if other opterators return empty distinct key. And as you mentioned, the global aggregate has already optimzied by EliminateDistinct and OptimizeOneRowPlan, so it's fine ?

My point is that DistinctKeyVisitor does not work with global aggregate now, and an empty expression set is still a valid distinct key, why do we forbid it?

We have already forbidden it inside DistinctKeyVisitor. Do you think we should support that case ?

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctKeyVisitor.scala

Line 50 in a67acba

}.filter(_.nonEmpty)

This is only done at the else branch, not the if branch. I think we have two options:

keep the requirement, and add the filter in the if branch as well

remove the requirement, and remove the filter from the else branch.

I prefer option 2 as I think an empty expression set does mean something as a distinct key, we should not ignore this information. It also works the same as other distinct keys:

It can replace all other distinct keys as it's a subset of any expression set

It can satisfy any distinct key requirement, e.g. remove unnecessary distinct in aggregate functions.

This is only done at the else branch, not the if branch

It's a good point. I will do a followup soon.

…or distinct key ### What changes were proposed in this pull request? - Improve `DistinctKeyVisitor` that support propagate empty set - Small improvement for match alias ### Why are the changes needed? Make distinct keys can be used to optimize more case, see comment #36117 (comment) ### Does this PR introduce _any_ user-facing change? Improve performance ### How was this patch tested? add test Closes #36281 from ulysses-you/SPARK-38832-followup. Authored-by: ulysses-you <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

Remove unnecessary distinct in aggregate expression by distinctKeys

7ebbafa

github-actions bot added the SQL label Apr 8, 2022

cloud-fan reviewed Apr 11, 2022

View reviewed changes

address comment

2ffc5ee

cloud-fan approved these changes Apr 11, 2022

View reviewed changes

wangyum reviewed Apr 12, 2022

View reviewed changes

ordering

dc67cfd

sigmod reviewed Apr 12, 2022

View reviewed changes

address comment

c16e995

sigmod approved these changes Apr 12, 2022

View reviewed changes

cloud-fan closed this in ee74bd0 Apr 13, 2022

ulysses-you deleted the remove-distinct branch April 13, 2022 10:40

cloud-fan reviewed Apr 19, 2022

View reviewed changes

ulysses-you mentioned this pull request Apr 20, 2022

[SPARK-38832][SQL][FOLLOWUP] Support propagate empty expression set for distinct key #36281

Closed

ulysses-you mentioned this pull request Jul 29, 2022

[SPARK-38932][SQL] Datasource v2 support report distinct keys #36253

Closed

[SPARK-38832][SQL] Remove unnecessary distinct in aggregate expression by distinctKeys #36117

[SPARK-38832][SQL] Remove unnecessary distinct in aggregate expression by distinctKeys #36117

Uh oh!

Conversation

ulysses-you commented Apr 8, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

ulysses-you commented Apr 11, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ulysses-you Apr 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangyum left a comment

Choose a reason for hiding this comment

Uh oh!

ulysses-you commented Apr 12, 2022

Uh oh!

sigmod Apr 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ulysses-you commented Apr 12, 2022

Uh oh!

cloud-fan commented Apr 13, 2022

Uh oh!

ulysses-you commented Apr 13, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ulysses-you Apr 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ulysses-you Apr 11, 2022 •

edited

Loading

sigmod Apr 12, 2022 •

edited

Loading

ulysses-you Apr 20, 2022 •

edited

Loading