Skip to content

Conversation

@ulysses-you
Copy link
Contributor

What changes were proposed in this pull request?

Make EliminateDistinct support eliminate distinct by child distinct keys.

Why are the changes needed?

We can remove the distinct in aggregate expression if the distinct semantics is guaranteed by child.

For example:

SELECT count(distinct c) FROM (
  SELECT c FROM t GROUP BY c
)

Does this PR introduce any user-facing change?

improve performance

How was this patch tested?

add test in EliminateDistinctSuite

@github-actions github-actions bot added the SQL label Apr 8, 2022
@ulysses-you
Copy link
Contributor Author

cc @wangyum @cloud-fan

.rebalance().groupBy()(countDistinct($"a") as "x", sumDistinct($"a") as "y").analyze
comparePlans(Optimize.execute(q2), q2)

// avoid remove double data type attr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will go wrong if we optimize this case as well?

Copy link
Contributor Author

@ulysses-you ulysses-you Apr 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Physical Aggregate will wrap NormalizeNaNAndZero for float/double to handle NaN and -0.0, so It's result value might be different with the original expression ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, the child will also do the same thing. We do not need consider this at here.

Copy link
Member

@wangyum wangyum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't work because EliminateDistinct is only executed once and it before EliminateSubqueryAliases:

=== Applying Rule org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases ===
 Aggregate [count(distinct id#3L) AS count(DISTINCT id)#5L]   Aggregate [count(distinct id#3L) AS count(DISTINCT id)#5L]
!+- SubqueryAlias __auto_generated_subquery_name              +- Aggregate [id#3L], [id#3L]
!   +- Aggregate [id#3L], [id#3L]                                +- Relation default.t[id#3L] parquet
!      +- SubqueryAlias spark_catalog.default.t               
!         +- Relation default.t[id#3L] parquet   

@ulysses-you
Copy link
Contributor Author

@wangyum good catch, seems EliminateDistinct should move to the place which after Finish Analysis batch. Why we put EliminateDistinct before Finish Analysis batch can be tracked since #18429 (comment). And after #29673, we move RewriteDistinctAggregates out of the Finish Analysis batch. So I think it's good to move this rule now.


case ae: AggregateExpression if ae.isDistinct &&
agg.child.distinctKeys.exists(
_.subsetOf(ExpressionSet(ae.aggregateFunction.children.filterNot(_.foldable)))) =>
Copy link
Contributor

@sigmod sigmod Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct?

If input plan to this rule is:

SELECT a, count(distinct c) FROM (
   SELECT distinct a, b, c 
   FROM t
)
GROUP BY a

Will the added case branch rewrite the plan to

SELECT a, count(c) FROM (
   SELECT distinct a, b, c 
   FROM t
)
GROUP BY a

agg.child.distinctKeys is {a, b, c}
ExpressionSet(ae.aggregateFunction.children.filterNot(_.foldable)) is {c}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the distinctKeys of distinct a, b, c is ExpressionSet(a, b, c) not ExpressionSet(a), ExpressionSet(b), ExpressionSet(c)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I forgot distinctKeys is a set of sets.

How about:

agg.child.distinctKeys.exists(
           key =>  !key.isEmpty() && 
                         key.subsetOf(ExpressionSet(ae.aggregateFunction.children.filterNot(_.foldable))))

Alternatively, we can do a require here to make sure that we never return an empty key:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanDistinctKeys.scala#L32

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, I add a require at LogicalPlanDistinctKeys

@ulysses-you
Copy link
Contributor Author

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in ee74bd0 Apr 13, 2022
@ulysses-you
Copy link
Contributor Author

thank you @cloud-fan @wangyum @sigmod

@ulysses-you ulysses-you deleted the remove-distinct branch April 13, 2022 10:40
if (conf.getConf(PROPAGATE_DISTINCT_KEYS_ENABLED)) DistinctKeyVisitor.visit(self) else Set.empty
if (conf.getConf(PROPAGATE_DISTINCT_KEYS_ENABLED)) {
val keys = DistinctKeyVisitor.visit(self)
require(keys.forall(_.nonEmpty))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this require? It looks fine to have an empty set as the distinct keys, e.g. global aggregate without keys. It means that the entire data set is distinct (have at most one row), and EliminateDistinct is OK with empty set in distinct keys.

Copy link
Contributor Author

@ulysses-you ulysses-you Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more about avoid some unexpected things. It will be a correctness issue if other opterators return empty distinct key. And as you mentioned, the global aggregate has already optimzied by EliminateDistinct and OptimizeOneRowPlan, so it's fine ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that DistinctKeyVisitor does not work with global aggregate now, and an empty expression set is still a valid distinct key, why do we forbid it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have already forbidden it inside DistinctKeyVisitor. Do you think we should support that case ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only done at the else branch, not the if branch. I think we have two options:

  1. keep the requirement, and add the filter in the if branch as well
  2. remove the requirement, and remove the filter from the else branch.

I prefer option 2 as I think an empty expression set does mean something as a distinct key, we should not ignore this information. It also works the same as other distinct keys:

  1. It can replace all other distinct keys as it's a subset of any expression set
  2. It can satisfy any distinct key requirement, e.g. remove unnecessary distinct in aggregate functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only done at the else branch, not the if branch

It's a good point. I will do a followup soon.

cloud-fan pushed a commit that referenced this pull request Apr 22, 2022
…or distinct key

### What changes were proposed in this pull request?

- Improve `DistinctKeyVisitor` that support propagate empty set
- Small improvement for match alias

### Why are the changes needed?

Make distinct keys can be used to optimize more case, see comment #36117 (comment)

### Does this PR introduce _any_ user-facing change?

Improve performance

### How was this patch tested?

add test

Closes #36281 from ulysses-you/SPARK-38832-followup.

Authored-by: ulysses-you <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants