Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Jul 17, 2021

What changes were proposed in this pull request?

  1. This pr add a new logical plan visitor named DistinctAttributesVisitor to find out all the distinct attributes in current logical plan. For example:

    spark.sql("CREATE TABLE t(a int, b int, c int) using parquet")
    spark.sql("SELECT a, b, a % 10, a AS aliased_a, max(c), sum(b) FROM t GROUP BY a, b").queryExecution.analyzed.distinctKeys

    The output is: {a#1, b#2}, {b#2, aliased_a#0}.

  2. Enhance RemoveRedundantAggregates to remove the aggregation from left semi/anti join if the same aggregation has already been done on left side. For example:

    set spark.sql.autoBroadcastJoinThreshold=-1; -- avoid PushDownLeftSemiAntiJoin
    create table t1 using parquet as select id a, id as b from range(10);
    create table t2 using parquet as select id as a, id as b from range(8);
    select t11.a, t11.b from (select distinct a, b from t1) t11 left semi join t2 on (t11.a = t2.a) group by t11.a, t11.b;

    Before this PR:

    == Optimized Logical Plan ==
    Aggregate [a#6L, b#7L], [a#6L, b#7L], Statistics(sizeInBytes=1492.0 B)
    +- Join LeftSemi, (a#6L = a#8L), Statistics(sizeInBytes=1492.0 B)
       :- Aggregate [a#6L, b#7L], [a#6L, b#7L], Statistics(sizeInBytes=1492.0 B)
       :  +- Filter isnotnull(a#6L), Statistics(sizeInBytes=1492.0 B)
       :     +- Relation default.t1[a#6L,b#7L] parquet, Statistics(sizeInBytes=1492.0 B)
       +- Project [a#8L], Statistics(sizeInBytes=984.0 B)
          +- Filter isnotnull(a#8L), Statistics(sizeInBytes=1476.0 B)
             +- Relation default.t2[a#8L,b#9L] parquet, Statistics(sizeInBytes=1476.0 B)
    

    After this PR:

    == Optimized Logical Plan ==
    Join LeftSemi, (a#6L = a#8L), Statistics(sizeInBytes=1492.0 B)
    :- Aggregate [a#6L, b#7L], [a#6L, b#7L], Statistics(sizeInBytes=1492.0 B)
    :  +- Filter isnotnull(a#6L), Statistics(sizeInBytes=1492.0 B)
    :     +- Relation default.t1[a#6L,b#7L] parquet, Statistics(sizeInBytes=1492.0 B)
    +- Project [a#8L], Statistics(sizeInBytes=984.0 B)
       +- Filter isnotnull(a#8L), Statistics(sizeInBytes=1476.0 B)
          +- Relation default.t2[a#8L,b#9L] parquet, Statistics(sizeInBytes=1476.0 B)
    

Why are the changes needed?

Improve query performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test and TPC-DS benchmark test.

SQL Before this PR(Seconds) After this PR(Seconds)
q14a 174  165
q38 26 23
q87 30 26

@github-actions github-actions bot added the SQL label Jul 17, 2021
@SparkQA
Copy link

SparkQA commented Jul 17, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45714/

@dongjoon-hyun
Copy link
Member

Thank you, @wangyum !

@SparkQA
Copy link

SparkQA commented Jul 17, 2021

Test build #141202 has finished for PR 33404 at commit 5486d64.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Also, cc @cloud-fan , @maropu , @viirya

@dongjoon-hyun
Copy link
Member

Thank you for adding more test case, @wangyum .

@SparkQA
Copy link

SparkQA commented Jul 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45744/

@SparkQA
Copy link

SparkQA commented Jul 19, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45744/

@SparkQA
Copy link

SparkQA commented Jul 19, 2021

Test build #141230 has finished for PR 33404 at commit 86a828a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 23, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46080/

@SparkQA
Copy link

SparkQA commented Jul 23, 2021

Test build #141562 has finished for PR 33404 at commit a004207.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 24, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46110/

@SparkQA
Copy link

SparkQA commented Jul 24, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46110/

@SparkQA
Copy link

SparkQA commented Jul 24, 2021

Test build #141593 has finished for PR 33404 at commit 02a0bf3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use Set[ExpressionSet]. If we group by a, b and then select a, b, a as c, then the distinct keys shold be Set([a, b], [c, b])

@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46272/

@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46272/

@SparkQA
Copy link

SparkQA commented Jul 28, 2021

Test build #141760 has finished for PR 33404 at commit 558aa31.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 29, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46321/

@SparkQA
Copy link

SparkQA commented Jul 29, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46321/

@SparkQA
Copy link

SparkQA commented Jul 29, 2021

Test build #141808 has finished for PR 33404 at commit eb71b8a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Aug 3, 2021

Another similar query:

select distinct STATUS,RecordTypeId, count(oracle_id) as cnt_id from t group by 1,2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it have to be join specific? it looks like it should be able to handle any nodes. Ideally could remove the entire case upper @ Aggregate(_, _, lower: Aggregate) branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I now noticed, that it was allready discussed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can generalize it. We can leverage the propagated distinct keys and remove group-only aggregate (or turn it into project) if the group cols are already distinct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer this

  /**
   * Add a new ExpressionSet S into distinctKeys D.
   * To minimize the size of D:
   * 1. If there is a subset of S in D, return D.
   * 2. Otherwise, remove all the ExpressionSet containing S from D, and add the new one.
   */
  private def addDistinctKey(
      keys: DistinctKeys,
      newExpressionSet: ExpressionSet): DistinctKeys = {
    if (keys.exists(_.subsetOf(newExpressionSet))) {
      keys
    } else {
      keys.filterNot(s => newExpressionSet.subsetOf(s)) + newExpressionSet
    }
  }

  /**
   * Propagate distinct keys with projectList.
   * For each alias in project list, replace the corresponding expression in distinctKeys.
   * To minimize the size of distinctKeys, remove all ExpressionSet that not subset of projectList.
   */
  private def projectDistinctKeys(
      keys: DistinctKeys,
      projectList: Seq[NamedExpression]): DistinctKeys = {
    val outputSet = ExpressionSet(projectList.map(_.toAttribute))
    val aliases = projectList.filter(_.isInstanceOf[Alias])
    if (aliases.isEmpty) return keys.filter(_.subsetOf(outputSet))

    val aliasedDistinctKeys = keys.map { expressionSet =>
      expressionSet.map { expression =>
        expression transform {
          case expr: Expression =>
            aliases
              .collectFirst { case a: Alias if a.child.semanticEquals(expr) => a.toAttribute }
              .getOrElse(expr)
        }
      }
    }
    aliasedDistinctKeys.collect {
      case es: ExpressionSet if es.subsetOf(outputSet) => ExpressionSet(es)
    }
  }

  override def visitAggregate(p: Aggregate): Set[ExpressionSet] = {
    val distinctKeysWithGrouping =
      addDistinctKey(p.child.distinctKeys, ExpressionSet(p.groupingExpressions))
    projectDistinctKeys(distinctKeysWithGrouping, p.aggregateExpressions)
  }

Copy link
Member Author

@wangyum wangyum Oct 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.sql("create table t1 (a int, b int, c int) using parquet")
spark.sql("select a, b, a as e, b as f from t1 group by a, b").queryExecution.analyzed.distinctKeys

For such a query, which distinct keys do you prefer?

Set(ExpressionSet(e#0, f#1))

or

Set(ExpressionSet(a#2, b#3), ExpressionSet(a#2, f#1), ExpressionSet(b#3, e#0), ExpressionSet(e#0, f#1))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it to the following code to support it.

  override def visitAggregate(p: Aggregate): Set[ExpressionSet] = {
    val groupingExps = ExpressionSet(p.groupingExpressions) // handle group by a, a
    val aggExpressions = p.aggregateExpressions.filter {
      case _: Attribute | _: Alias => true
      case _ => false
    }

    aggExpressions.toSet.subsets(groupingExps.size).filter { s =>
      groupingExps.subsetOf(ExpressionSet(s.map {
        case a: Alias => a.child
        case o => o
      }))
    }.map(s => ExpressionSet(s.map(_.toAttribute))).toSet
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample can also propagate the distinct keys from child

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not define this in DistinctKeyVisitor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it to DistinctKeyVisitor.

@SparkQA
Copy link

SparkQA commented Nov 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49282/

@SparkQA
Copy link

SparkQA commented Nov 1, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49282/

@SparkQA
Copy link

SparkQA commented Nov 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49284/

@SparkQA
Copy link

SparkQA commented Nov 1, 2021

Test build #144812 has finished for PR 33404 at commit 92be175.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 1, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49284/

@SparkQA
Copy link

SparkQA commented Nov 1, 2021

Test build #144814 has finished for PR 33404 at commit fc11208.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 2, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49292/

@SparkQA
Copy link

SparkQA commented Nov 2, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49292/

@SparkQA
Copy link

SparkQA commented Nov 2, 2021

Test build #144823 has finished for PR 33404 at commit f1dec16.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Feb 11, 2022
@cloud-fan cloud-fan removed the Stale label Feb 11, 2022
@cloud-fan
Copy link
Contributor

@wangyum do you have time to revisit this and pass all tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: DistinctKeyVisitor seems a simpler and more general name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I don't think the order of ExpressionSet matters, and Set[ExpressionSet] is better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and shall we make it a trait so that LogicalPlan can extend it directly? then we don't need LogicalPlanDistinctKeys. I think it's simpler if we only have one visitor implementation for distinct keys, which should be true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 1 to make it a trait.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should propagate the distinct keys from the child as well. This should be done in other places as well so we need to add a method for it, e.g.

 def addDistinctKey(keys: Set[AttributeSet], newExpressionSet: ExpressionSet): Set[AttributeSet]

The idea is: if keys already indicate the newExpressionSet, e.g. we have [a, b] in keys, we can ignore newExpressionSet if it's [a, b, c]. Else, we should clean up keys and add newExpressionSet. e.g. we have [a, b, c] in keys and the newExpressionSet is [a, b], then we should remove [a, b, c] from keys.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finally we can simply write val distinctKeys = addDistinctKey(p.child.distinctKeys, ExpressionSet(p.groupingExpressions)) here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the distinct attributes related to the child distinct attributes? For example:

create table t(a int, b int, c int, d int, e int) using parquet;
select a, b, c, sum(d) from (select distinct * from t) t1 group by a, b, c;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly for project list so maybe projectDistinctKeys is a better name: def projectDistinctKeys(keys: Set[ExpressionSet], projectList: Seq[NamedExpression]): Set[ExpressionSet]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@cloud-fan cloud-fan Feb 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite inefficient, to generate all the subsets. I think the logic here is:

  1. produce correct distinct keys w.r.t. the alias mapping in the project list
  2. prune invalid distinct keys that are not output by the project list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a filter before build subsets:

    val expressions = keys.flatMap(_.toSet)
    projectList.filter {
      case a: Alias => expressions.exists(_.semanticEquals(a.child))
      case ne: NamedExpression => expressions.exists(_.semanticEquals(ne))
    }.toSet.subsets(keys.map(_.size).min).filter { s =>
      val references = s.map {
        case a: Alias => a.child
        case ne => ne
      }
      keys.exists(_.equals(ExpressionSet(references)))
    }.map(s => AttributeSet(s.map(_.toAttribute))).toSet

import org.apache.spark.sql.catalyst.expressions.{Alias, AttributeSet, ExpressionSet, NamedExpression}


trait QueryPlanDistinctKeys {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer something like this

trait DistinctKeyVisitor extends LogicalPlanVisitor[Set[ExpressionSet]] { self: LogicalPlan =>

  lazy val distinctKeys: DistinctKeys = {
    if (check conf) {
      visit(self)
    } else {
      default(self)
    }
  }

  def visitXXX
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefit is that we can centralize the distinct key logic in this file, not in many LogicalPlan classes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New pull request: #35651

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants