[SPARK-36194][SQL] Add a logical plan visitor to propagate the distinct attributes #35651

wangyum · 2022-02-24T14:08:53Z

What changes were proposed in this pull request?

This pr add a new logical plan visitor named DistinctKeyVisitor to find out all the distinct attributes in current logical plan. For example:

spark.sql("CREATE TABLE t(a int, b int, c int) using parquet")
spark.sql("SELECT a, b, a % 10, a AS aliased_a, max(c), sum(b) FROM t GROUP BY a, b").queryExecution.analyzed.distinctKeys

The output is: {a#1, b#2}, {b#2, aliased_a#0}.

Enhance RemoveRedundantAggregates to remove the aggregation if it is groupOnly and the child can guarantee distinct. For example:

set spark.sql.autoBroadcastJoinThreshold=-1; -- avoid PushDownLeftSemiAntiJoin
create table t1 using parquet as select id a, id as b from range(10);
create table t2 using parquet as select id as a, id as b from range(8);
select t11.a, t11.b from (select distinct a, b from t1) t11 left semi join t2 on (t11.a = t2.a) group by t11.a, t11.b;

Before this PR:

== Optimized Logical Plan ==
Aggregate [a#6L, b#7L], [a#6L, b#7L], Statistics(sizeInBytes=1492.0 B)
+- Join LeftSemi, (a#6L = a#8L), Statistics(sizeInBytes=1492.0 B)
   :- Aggregate [a#6L, b#7L], [a#6L, b#7L], Statistics(sizeInBytes=1492.0 B)
   :  +- Filter isnotnull(a#6L), Statistics(sizeInBytes=1492.0 B)
   :     +- Relation default.t1[a#6L,b#7L] parquet, Statistics(sizeInBytes=1492.0 B)
   +- Project [a#8L], Statistics(sizeInBytes=984.0 B)
      +- Filter isnotnull(a#8L), Statistics(sizeInBytes=1476.0 B)
         +- Relation default.t2[a#8L,b#9L] parquet, Statistics(sizeInBytes=1476.0 B)

After this PR:

== Optimized Logical Plan ==
Join LeftSemi, (a#6L = a#8L), Statistics(sizeInBytes=1492.0 B)
:- Aggregate [a#6L, b#7L], [a#6L, b#7L], Statistics(sizeInBytes=1492.0 B)
:  +- Filter isnotnull(a#6L), Statistics(sizeInBytes=1492.0 B)
:     +- Relation default.t1[a#6L,b#7L] parquet, Statistics(sizeInBytes=1492.0 B)
+- Project [a#8L], Statistics(sizeInBytes=984.0 B)
   +- Filter isnotnull(a#8L), Statistics(sizeInBytes=1476.0 B)
      +- Relation default.t2[a#8L,b#9L] parquet, Statistics(sizeInBytes=1476.0 B)

Why are the changes needed?

Improve query performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test and TPC-DS benchmark test.

SQL	Before this PR(Seconds)	After this PR(Seconds)
q14a	206	193
q38	59	41
q87	127	113

…on has already been done on left side

cloud-fan · 2022-02-24T14:52:41Z

...yst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanDistinctKeys.scala

+ * }}}
+ */
+trait LogicalPlanDistinctKeys { self: LogicalPlan =>
+  lazy val distinctKeys: Set[AttributeSet] = DistinctKeyVisitor.visit(self)


can we add a config for this feature? If the config is off, here we just return Set.empty

e.g. spark.sql.optimizer.propagateDistinctKeys.enabled

cloud-fan · 2022-02-24T14:55:17Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctKeyVisitor.scala

+        case ne => ne
+      }
+      keys.exists(_.equals(ExpressionSet(references)))
+    }.map(s => AttributeSet(s.map(_.toAttribute))).toSet


how about

val outputSet = ExpressionSet(projectList.map(_.toAttribute)) val aliases = projectList.filter(_.isInstanceOf[Alias]) if (aliases.isEmpty) return keys.filter(_.subsetOf(outputSet)) val aliasedDistinctKeys = keys.map { expressionSet => expressionSet.map { expression => expression transform { case expr: Expression => aliases .collectFirst { case a: Alias if a.child.semanticEquals(expr) => a.toAttribute } .getOrElse(expr) } } } aliasedDistinctKeys.collect { case es: ExpressionSet if es.subsetOf(outputSet) => ExpressionSet(es) }

If one expression has multiple aliases, we need to further expand the distinct keys set. We can do it later as it's rather a corner case.

cloud-fan · 2022-02-24T14:59:24Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctKeyVisitor.scala

+
+  override def visitGenerate(p: Generate): Set[AttributeSet ] = default(p)
+
+  override def visitGlobalLimit(p: GlobalLimit): Set[AttributeSet ] = p.child.distinctKeys


if the limit value is 1 or 0, all output columns are distinct.

cloud-fan · 2022-02-24T15:00:53Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctKeyVisitor.scala

+
+  override def visitJoin(p: Join): Set[AttributeSet] = {
+    p.joinType match {
+      case LeftExistence(_) => p.left.distinctKeys


shall we exclude ExistenceJoin?

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctKeyVisitor.scala

cloud-fan · 2022-02-24T15:02:51Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctKeyVisitor.scala

+  override def visitRepartitionByExpr(p: RepartitionByExpression): Set[AttributeSet] =
+    p.child.distinctKeys
+
+  override def visitSample(p: Sample): Set[AttributeSet] = default(p)


For Sample without replacement, we can propagate the distinct keys from child.

cloud-fan · 2022-02-24T15:06:21Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctKeyVisitor.scala

+  override def visitRebalancePartitions(p: RebalancePartitions): Set[AttributeSet] =
+    p.child.distinctKeys
+
+  override def visitWithCTE(p: WithCTE): Set[AttributeSet] = default(p)


CTE can also propagate distinct keys from child.

cloud-fan · 2022-02-24T15:07:05Z

due to significant plan changes caused by this PR, can we verify the TPCDS performance?

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctKeyVisitor.scala

wangyum · 2022-02-25T13:15:03Z

due to significant plan changes caused by this PR, can we verify the TPCDS performance?

OK

cloud-fan · 2022-02-28T14:59:38Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregates.scala

        newAggregate
      }
+
+     case agg @ Aggregate(groupingExps, _, child) if agg.groupOnly && child.deterministic &&


does child.deterministic matter here?

We had a test like this before:

spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregatesSuite.scala

Lines 104 to 114 in a34e2af

test("Remove redundant aggregate with non-deterministic upper") {

val query = relation

.groupBy('a)('a)

.groupBy('a)('a, rand(0) as 'c)

.analyze

val expected = relation

.groupBy('a)('a, rand(0) as 'c)

.analyze

val optimized = Optimize.execute(query)

comparePlans(optimized, expected)

}

which means child.deterministic doesn't matter? The test you posted did optimize out one aggregate.

Sorry. This test:

spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregatesSuite.scala

Lines 164 to 171 in f0db8ec

test("Keep non-redundant aggregate - upper references non-deterministic non-grouping") {

val query = relation

.groupBy('a)('a, ('a + rand(0)) as 'c)

.groupBy('a, 'c)('a, 'c)

.analyze

val optimized = Optimize.execute(query)

comparePlans(optimized, query)

}

cloud-fan · 2022-02-28T15:05:10Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctKeyVisitor.scala

+      case Inner =>
+        p match {
+          case ExtractEquiJoinKeys(_, leftKeys, rightKeys, _, _, _, _, _)
+              if p.left.distinctKeys.exists(_.subsetOf(ExpressionSet(leftKeys))) &&


I think we should use || here. If there is only one side has valid distinct keys, we should still propagate that side.

spark.sql("create table t1(a int, b int) using parquet") spark.sql("create table t2(x int, y int) using parquet") spark.sql("insert into t1 values(1, 1), (2, 2)") spark.sql("insert into t2 values(1, 1), (1, 1), (2, 2), (2, 2)") spark.sql("select * from (select distinct * from t1 )t1 join t2 on t1.a = t2.x and t1.b = t2.y").show

The output is:

+---+---+---+---+ | a| b| x| y| +---+---+---+---+ | 1| 1| 1| 1| | 1| 1| 1| 1| | 2| 2| 2| 2| | 2| 2| 2| 2| +---+---+---+---+

We can't distinguish the distinct keys.

cloud-fan · 2022-02-28T15:06:04Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctKeyVisitor.scala

+            Set(ExpressionSet(leftKeys), ExpressionSet(rightKeys))
+          case _ => default(p)
+        }
+      case _ => default(p)


for left outer, we can propagate from right side.

spark.sql("create table t1(a int, b int) using parquet") spark.sql("create table t2(x int, y int) using parquet") spark.sql("insert into t1 values(1, 1), (2, 2)") spark.sql("insert into t2 values(3, 3), (4, 4)") spark.sql("select * from t1 left join (select distinct * from t2)t2 on t1.a = t2.x and t1.b = t2.y").show

The output is:

+---+---+----+----+ | a| b| x| y| +---+---+----+----+ | 2| 2|null|null| | 1| 1|null|null| +---+---+----+----+

We can't distinguish the distinct keys.

Sorry my fault. We can propagate the left side distinct keys if p.right.distinctKeys.exists(_.subsetOf(rightJoinKeySet))

cloud-fan · 2022-02-28T15:07:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala


  def constraintPropagationEnabled: Boolean = getConf(CONSTRAINT_PROPAGATION_ENABLED)

+  def propagateDistinctKeysEnabled: Boolean = getConf(PROPAGATE_DISTINCT_KEYS_ENABLED)


it's only called once, we can inline it.

cloud-fan · 2022-02-28T15:12:45Z

.../src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregatesSuite.scala

+    comparePlans(optimized, correctAnswer)
+  }
+
+  test("SPARK-36194: Negative case: The grouping expressions not same") {


we don't require them to be the same. We need the child distinct keys to be a subset of the required grouping keys.

Yes. update the test name.

cloud-fan · 2022-02-28T15:23:38Z

.../src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregatesSuite.scala

+    Seq(LeftSemi, LeftAnti).foreach { joinType =>
+      val originalQuery = x.groupBy('a, 'b)('a, 'b)
+        .join(y, joinType, Some("x.a".attr === "y.a".attr && "x.b".attr === "y.b".attr))
+        .groupBy("x.a".attr, "x.b".attr)(TrueLiteral)


hmmm why can't we optimize this query?

Yes. Update RemoveRedundantAggregates to support this case:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregates.scala

Lines 56 to 60 in f0db8ec

case agg @ Aggregate(groupingExps, aggregateExps, child)

if aggregateExps.forall(a => a.isInstanceOf[Alias] && a.children.forall(_.foldable)) &&

child.deterministic &&

child.distinctKeys.exists(_.subsetOf(ExpressionSet(groupingExps))) =>

Project(agg.aggregateExpressions, child)

cloud-fan · 2022-02-28T15:24:53Z

.../src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregatesSuite.scala

+    }
+  }
+
+  test("SPARK-36194: Negative case: The aggregate expressions not same") {


the test name is a bit misleading. I don't think aggregate expressions matter (as long as it's a group only aggregate), grouping expressions matter.

Fixed the test name.

cloud-fan · 2022-03-01T12:32:03Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregates.scala

+      Project(agg.aggregateExpressions, child)
+
+    case agg @ Aggregate(groupingExps, aggregateExps, child)
+        if aggregateExps.forall(a => a.isInstanceOf[Alias] && a.children.forall(_.foldable)) &&


how about a mix? e.g. SELECT a, 1, b FROM ... GROUP BY a, b, c

We already support this case, add a new test case:
https://github.com/apache/spark/pull/35779/files#diff-7cd933ffc7b9ce86d5973bee80f4a5bd4a021c0f0ff81defe1f020bcb55b4b3bR153-R159

cloud-fan · 2022-03-01T12:43:27Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctKeyVisitor.scala

+      case ExtractEquiJoinKeys(Inner, leftKeys, rightKeys, _, _, _, _, _)
+        if p.left.distinctKeys.exists(_.subsetOf(ExpressionSet(leftKeys))) &&
+          p.right.distinctKeys.exists(_.subsetOf(ExpressionSet(rightKeys))) =>
+        Set(ExpressionSet(leftKeys), ExpressionSet(rightKeys))


if p.left.distinctKeys.exists(_.subsetOf(ExpressionSet(leftKeys))), we can propagate the right side distinct keys, right?

wangyum · 2022-03-09T03:14:58Z

New PR: #35779

wangyum added 8 commits February 23, 2022 16:14

Remove the aggregation from left semi/anti join if the same aggregati…

49e62ff

…on has already been done on left side

Add more test

6a02605

grouping -> groupingExps

900837f

Add DistinctAttributesVisitor

c44f242

Fix test name

4bc92a6

Improve DistinctAttributesVisitor

1659465

Fix test.

8ac519f

DistinctKeyVisitor

a34e2af

github-actions bot added the SQL label Feb 24, 2022

wangyum changed the title ~~[SPARK-36194][SQL] Remove the aggregation from left semi/anti join if the same aggregation has already been done on left side~~ [SPARK-36194][SQL] Add A logical plan visitor to propagate the distinct attributes Feb 24, 2022

wangyum mentioned this pull request Feb 24, 2022

[SPARK-36194][SQL] Remove the aggregation from left semi/anti join if the same aggregation has already been done on left side #33404

Closed

cloud-fan reviewed Feb 24, 2022

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctKeyVisitor.scala Show resolved Hide resolved

cloud-fan reviewed Feb 24, 2022

View reviewed changes

wangyum commented Feb 25, 2022

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctKeyVisitor.scala Outdated Show resolved Hide resolved

Address comments

62e9cd7

wangyum changed the title ~~[SPARK-36194][SQL] Add A logical plan visitor to propagate the distinct attributes~~ [SPARK-36194][SQL] Add a logical plan visitor to propagate the distinct attributes Feb 25, 2022

Fix scala 2.13

1ef1dea

cloud-fan reviewed Feb 28, 2022

View reviewed changes

Address comments

f0db8ec

cloud-fan reviewed Mar 1, 2022

View reviewed changes

This pull request was closed.


		override def visitGenerate(p: Generate): Set[AttributeSet ] = default(p)

		override def visitGlobalLimit(p: GlobalLimit): Set[AttributeSet ] = p.child.distinctKeys

	test("Remove redundant aggregate with non-deterministic upper") {
	val query = relation
	.groupBy('a)('a)
	.groupBy('a)('a, rand(0) as 'c)
	.analyze
	val expected = relation
	.groupBy('a)('a, rand(0) as 'c)
	.analyze
	val optimized = Optimize.execute(query)
	comparePlans(optimized, expected)
	}

	test("Keep non-redundant aggregate - upper references non-deterministic non-grouping") {
	val query = relation
	.groupBy('a)('a, ('a + rand(0)) as 'c)
	.groupBy('a, 'c)('a, 'c)
	.analyze
	val optimized = Optimize.execute(query)
	comparePlans(optimized, query)
	}


		def constraintPropagationEnabled: Boolean = getConf(CONSTRAINT_PROPAGATION_ENABLED)

		def propagateDistinctKeysEnabled: Boolean = getConf(PROPAGATE_DISTINCT_KEYS_ENABLED)

	case agg @ Aggregate(groupingExps, aggregateExps, child)
	if aggregateExps.forall(a => a.isInstanceOf[Alias] && a.children.forall(_.foldable)) &&
	child.deterministic &&
	child.distinctKeys.exists(_.subsetOf(ExpressionSet(groupingExps))) =>
	Project(agg.aggregateExpressions, child)

[SPARK-36194][SQL] Add a logical plan visitor to propagate the distinct attributes #35651

[SPARK-36194][SQL] Add a logical plan visitor to propagate the distinct attributes #35651

Uh oh!

Conversation

wangyum commented Feb 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Feb 24, 2022

Uh oh!

Uh oh!

wangyum commented Feb 25, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Feb 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Feb 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

wangyum commented Feb 24, 2022 •

edited

Loading

cloud-fan Feb 28, 2022 •

edited

Loading

cloud-fan Feb 28, 2022 •

edited

Loading