-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-36194][SQL] Add a logical plan visitor to propagate the distinct attributes #35651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…on has already been done on left side
| * }}} | ||
| */ | ||
| trait LogicalPlanDistinctKeys { self: LogicalPlan => | ||
| lazy val distinctKeys: Set[AttributeSet] = DistinctKeyVisitor.visit(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a config for this feature? If the config is off, here we just return Set.empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. spark.sql.optimizer.propagateDistinctKeys.enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
| case ne => ne | ||
| } | ||
| keys.exists(_.equals(ExpressionSet(references))) | ||
| }.map(s => AttributeSet(s.map(_.toAttribute))).toSet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
val outputSet = ExpressionSet(projectList.map(_.toAttribute))
val aliases = projectList.filter(_.isInstanceOf[Alias])
if (aliases.isEmpty) return keys.filter(_.subsetOf(outputSet))
val aliasedDistinctKeys = keys.map { expressionSet =>
expressionSet.map { expression =>
expression transform {
case expr: Expression =>
aliases
.collectFirst { case a: Alias if a.child.semanticEquals(expr) => a.toAttribute }
.getOrElse(expr)
}
}
}
aliasedDistinctKeys.collect {
case es: ExpressionSet if es.subsetOf(outputSet) => ExpressionSet(es)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If one expression has multiple aliases, we need to further expand the distinct keys set. We can do it later as it's rather a corner case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
|
|
||
| override def visitGenerate(p: Generate): Set[AttributeSet ] = default(p) | ||
|
|
||
| override def visitGlobalLimit(p: GlobalLimit): Set[AttributeSet ] = p.child.distinctKeys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the limit value is 1 or 0, all output columns are distinct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
|
||
| override def visitJoin(p: Join): Set[AttributeSet] = { | ||
| p.joinType match { | ||
| case LeftExistence(_) => p.left.distinctKeys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we exclude ExistenceJoin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctKeyVisitor.scala
Show resolved
Hide resolved
| override def visitRepartitionByExpr(p: RepartitionByExpression): Set[AttributeSet] = | ||
| p.child.distinctKeys | ||
|
|
||
| override def visitSample(p: Sample): Set[AttributeSet] = default(p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Sample without replacement, we can propagate the distinct keys from child.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| override def visitRebalancePartitions(p: RebalancePartitions): Set[AttributeSet] = | ||
| p.child.distinctKeys | ||
|
|
||
| override def visitWithCTE(p: WithCTE): Set[AttributeSet] = default(p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CTE can also propagate distinct keys from child.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
due to significant plan changes caused by this PR, can we verify the TPCDS performance? |
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/DistinctKeyVisitor.scala
Outdated
Show resolved
Hide resolved
OK |
| newAggregate | ||
| } | ||
|
|
||
| case agg @ Aggregate(groupingExps, _, child) if agg.groupOnly && child.deterministic && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does child.deterministic matter here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a test like this before:
Lines 104 to 114 in a34e2af
| test("Remove redundant aggregate with non-deterministic upper") { | |
| val query = relation | |
| .groupBy('a)('a) | |
| .groupBy('a)('a, rand(0) as 'c) | |
| .analyze | |
| val expected = relation | |
| .groupBy('a)('a, rand(0) as 'c) | |
| .analyze | |
| val optimized = Optimize.execute(query) | |
| comparePlans(optimized, expected) | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which means child.deterministic doesn't matter? The test you posted did optimize out one aggregate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. This test:
Lines 164 to 171 in f0db8ec
| test("Keep non-redundant aggregate - upper references non-deterministic non-grouping") { | |
| val query = relation | |
| .groupBy('a)('a, ('a + rand(0)) as 'c) | |
| .groupBy('a, 'c)('a, 'c) | |
| .analyze | |
| val optimized = Optimize.execute(query) | |
| comparePlans(optimized, query) | |
| } |
| case Inner => | ||
| p match { | ||
| case ExtractEquiJoinKeys(_, leftKeys, rightKeys, _, _, _, _, _) | ||
| if p.left.distinctKeys.exists(_.subsetOf(ExpressionSet(leftKeys))) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use || here. If there is only one side has valid distinct keys, we should still propagate that side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.sql("create table t1(a int, b int) using parquet")
spark.sql("create table t2(x int, y int) using parquet")
spark.sql("insert into t1 values(1, 1), (2, 2)")
spark.sql("insert into t2 values(1, 1), (1, 1), (2, 2), (2, 2)")
spark.sql("select * from (select distinct * from t1 )t1 join t2 on t1.a = t2.x and t1.b = t2.y").showThe output is:
+---+---+---+---+
| a| b| x| y|
+---+---+---+---+
| 1| 1| 1| 1|
| 1| 1| 1| 1|
| 2| 2| 2| 2|
| 2| 2| 2| 2|
+---+---+---+---+
We can't distinguish the distinct keys.
| Set(ExpressionSet(leftKeys), ExpressionSet(rightKeys)) | ||
| case _ => default(p) | ||
| } | ||
| case _ => default(p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for left outer, we can propagate from right side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.sql("create table t1(a int, b int) using parquet")
spark.sql("create table t2(x int, y int) using parquet")
spark.sql("insert into t1 values(1, 1), (2, 2)")
spark.sql("insert into t2 values(3, 3), (4, 4)")
spark.sql("select * from t1 left join (select distinct * from t2)t2 on t1.a = t2.x and t1.b = t2.y").showThe output is:
+---+---+----+----+
| a| b| x| y|
+---+---+----+----+
| 2| 2|null|null|
| 1| 1|null|null|
+---+---+----+----+
We can't distinguish the distinct keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry my fault. We can propagate the left side distinct keys if p.right.distinctKeys.exists(_.subsetOf(rightJoinKeySet))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
|
||
| def constraintPropagationEnabled: Boolean = getConf(CONSTRAINT_PROPAGATION_ENABLED) | ||
|
|
||
| def propagateDistinctKeysEnabled: Boolean = getConf(PROPAGATE_DISTINCT_KEYS_ENABLED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's only called once, we can inline it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
| comparePlans(optimized, correctAnswer) | ||
| } | ||
|
|
||
| test("SPARK-36194: Negative case: The grouping expressions not same") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't require them to be the same. We need the child distinct keys to be a subset of the required grouping keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. update the test name.
| Seq(LeftSemi, LeftAnti).foreach { joinType => | ||
| val originalQuery = x.groupBy('a, 'b)('a, 'b) | ||
| .join(y, joinType, Some("x.a".attr === "y.a".attr && "x.b".attr === "y.b".attr)) | ||
| .groupBy("x.a".attr, "x.b".attr)(TrueLiteral) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm why can't we optimize this query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Update RemoveRedundantAggregates to support this case:
Lines 56 to 60 in f0db8ec
| case agg @ Aggregate(groupingExps, aggregateExps, child) | |
| if aggregateExps.forall(a => a.isInstanceOf[Alias] && a.children.forall(_.foldable)) && | |
| child.deterministic && | |
| child.distinctKeys.exists(_.subsetOf(ExpressionSet(groupingExps))) => | |
| Project(agg.aggregateExpressions, child) |
| } | ||
| } | ||
|
|
||
| test("SPARK-36194: Negative case: The aggregate expressions not same") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the test name is a bit misleading. I don't think aggregate expressions matter (as long as it's a group only aggregate), grouping expressions matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the test name.
| Project(agg.aggregateExpressions, child) | ||
|
|
||
| case agg @ Aggregate(groupingExps, aggregateExps, child) | ||
| if aggregateExps.forall(a => a.isInstanceOf[Alias] && a.children.forall(_.foldable)) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about a mix? e.g. SELECT a, 1, b FROM ... GROUP BY a, b, c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already support this case, add a new test case:
https://github.com/apache/spark/pull/35779/files#diff-7cd933ffc7b9ce86d5973bee80f4a5bd4a021c0f0ff81defe1f020bcb55b4b3bR153-R159
| case ExtractEquiJoinKeys(Inner, leftKeys, rightKeys, _, _, _, _, _) | ||
| if p.left.distinctKeys.exists(_.subsetOf(ExpressionSet(leftKeys))) && | ||
| p.right.distinctKeys.exists(_.subsetOf(ExpressionSet(rightKeys))) => | ||
| Set(ExpressionSet(leftKeys), ExpressionSet(rightKeys)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if p.left.distinctKeys.exists(_.subsetOf(ExpressionSet(leftKeys))), we can propagate the right side distinct keys, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
New PR: #35779 |
What changes were proposed in this pull request?
This pr add a new logical plan visitor named
DistinctKeyVisitorto find out all the distinct attributes in current logical plan. For example:The output is: {a#1, b#2}, {b#2, aliased_a#0}.
Enhance
RemoveRedundantAggregatesto remove the aggregation if it is groupOnly and the child can guarantee distinct. For example:Before this PR:
After this PR:
Why are the changes needed?
Improve query performance.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit test and TPC-DS benchmark test.