Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Jun 29, 2021

What changes were proposed in this pull request?

This PR uses 2 ideas to make EquivalentExpressions more efficient:

  1. do not keep all the equivalent expressions, we only need a count
  2. track the "height" of common subexpressions, to quickly do child-parent sort, and filter out non-child expressions in addCommonExprs

This PR also fixes several small bugs (exposed by the refactoring), please see PR comments.

Why are the changes needed?

code cleanup and small perf improvement

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

@github-actions github-actions bot added the SQL label Jun 29, 2021
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This partially solves the perf issue mentioned in https://github.com/apache/spark/pull/32559/files#r633488455

By filtering with height first, we can reduce the data to iterate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #33281 to improve it further.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure we can trigger this bug with some real queries, but it's an obvious bug to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we wrongly recurse into the children of CodegenFallback, it only produces unused subexpressions. Some redundant generated codes, i.e..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to backport this part into branch-3.1/3.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea will do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@cloud-fan cloud-fan Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes #30245 (comment)

Basically it takes all the conditions as the commonChildrenToRecurse, so that we only get the common expressions that appear in all the conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kimahriman I think this fix works? The only drawback is, if there are common subexpressions among the conditions, they will always be counted as "appear twice" and gets codegened into methods.

I think the perf overhead is really small, and if the first condition is false, we evaluate the next condition which gives perf improvement because of common subexpressions elimination.

For the value branches of CaseWhen, I don't touch them in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this definitely fixes a potential bug of creating subexpressions for things that are never evaluated, same with the coalesce update. I think the values are already handled fine, it's just the conditionals that had an issue with short circuiting

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixed #30245 (comment).

The only drawback is, if there are common subexpressions among the conditions, they will always be counted as "appear twice" and gets codegened into methods.

I just don't get this. You mean for If(a + b > 1, 1, a + b + c > 1, 2, a + b + c > 2, 3), a + b + c will be counted twice and considered as common subexpression?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think he means in CaseWhen(a + b > 1, 1, a + b + c > 1, 2), a + b will be a subexpression even though it might only be executed once.

Copy link
Contributor

@Kimahriman Kimahriman Jun 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But CaseWhen(a + b > 1, 1, a + b + c > 1, 2, a + b + c > 0, 3), a + b + c won't even be considered for a subexpression if it's seen elsewhere, which was the bug if CaseWhen supports short circuiting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because the first condition of CaseWhen is in both childrenToRecurse and commonChildrenToRecurse

@cloud-fan
Copy link
Contributor Author

cc @viirya @maropu

@SparkQA
Copy link

SparkQA commented Jun 29, 2021

Test build #140399 has finished for PR 33142 at commit e1362da.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ExpressionEquals(e: Expression)
  • case class ExpressionStats(expr: Expression)(

@viirya
Copy link
Member

viirya commented Jun 29, 2021

track the "height" of common subexpressions, to quickly do child-parent sort.

About this, I think the sorting is not reliable as it is hard to do child-parent sort. I have another proposal to get rid of the sort as I mentioned before.

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Jun 30, 2021

Can you briefly introduce your idea? Sorting by height is stable and fast now.

And I need the height anyway in #33142 (comment)

@viirya
Copy link
Member

viirya commented Jun 30, 2021

Can you briefly introduce your idea? Sorting by height is stable and fast now.

Basically, the steps are:

  1. Propagate the SubExprEliminationState map for all subexprs (no needed to be sorted). Only create the value and isNull variables, don't do codegen yet.
  2. Iterate all subexprs to do codegen. Because expression codegen will look at the map to replace subexprs, any subexpr in children will be replaced and chained. So we don't need to sort subexprs in advance.

@SparkQA
Copy link

SparkQA commented Jun 30, 2021

Test build #140441 has finished for PR 33142 at commit f925cc9.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ExpressionEquals(e: Expression)
  • case class ExpressionStats(expr: Expression)(
  • case class SubExprEliminationState(eval: ExprCode, children: Seq[SubExprEliminationState])

@SparkQA
Copy link

SparkQA commented Jun 30, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44955/

@viirya
Copy link
Member

viirya commented Jun 30, 2021

Can you briefly introduce your idea? Sorting by height is stable and fast now.

I've not looked in the details yet. Is sorting by height guaranteed to sort expressions by child-parent? I said current sorting is not reliable because it might miss some cases probably. It is because two expressions with no child-parent relation has no clear comparison order. So sorting is somehow unreliable for expressions. Does sorting by height solve it?

* Instead of appending to a mutable list/buffer of Expressions, just update the "flattened"
* useCount in this wrapper in-place.
*
* This also tracks the "height" of the expression, so that we can return expressions with smaller
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "height" of the expression -> track the "height" of common subexpressions?

if (!skip && !addExprToMap(expr, map)) {
val height = childrenToRecurse(expr).map(addExprTree0(_, map))
.reduceOption(_ max _).map(_ + 1).getOrElse(0)
map(ExpressionEquals(expr)).height = height
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(My comment is the same with the @viirya one) we can always judge if an expr is a parent of another expr or not from this height? It seems this height depends on a map state, so a true height value can change after the assignment? For this purpose, we cannot simply use the height of an expression instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that @viirya will remove the sort, I think this issue doesn't matter now (not worse than before). In addCommonExprs, I only use this height to do filtering, which should be OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this purpose, we cannot simply use the height of an expression instead?

makes sense. A true height (from root to the furthest leaf) is good enough to quickly check child-parent relationship.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to backport this part into branch-3.1/3.0?

@cloud-fan
Copy link
Contributor Author

So sorting is somehow unreliable for expressions. Does sorting by height solve it?

As long as the sorting algorithm is stable (i.e. retain the original input order), it's stable. It's similar to sort strings by length.

assert(a.hashCode != b3.hashCode)
assert(a.semanticEquals(b3))
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this file is to adapt the new API in EquivalentExpressions. e.g.
getAllEquivalentExprs() -> getAllExprStates
getEquivalentExprs(oneA).size == 1 -> getExprState(oneA).get.useCount == 1
getEquivalentExprs(oneA).exists(_ eq oneA) -> getExprState(oneA).exists(_.expr eq oneA)

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44992/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44992/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Test build #140480 has finished for PR 33142 at commit 92786a6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ExpressionStats(expr: Expression)(var useCount: Int = 1)

exprSet.intersect(otherExprSet)
map: mutable.HashMap[ExpressionEquals, ExpressionStats]): Unit = {
assert(exprs.length > 1)
var localEquivalenceMap = mutable.HashMap.empty[ExpressionEquals, ExpressionStats]
Copy link
Contributor

@cfmcgrady cfmcgrady Jul 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also fixed that, previously, for Or(Coalesce(expr1, expr2, expr2), Coalesce(expr1, expr2, expr2)), expr2 will be extracted and considered as a common subexpression. Currently, no subexpression will be extracted.

@viirya
Copy link
Member

viirya commented Jul 3, 2021

@maropu Any more comments? Otherwise I will merge this tomorrow. Thanks.

@viirya
Copy link
Member

viirya commented Jul 3, 2021

Thanks! Merging to master/branch-3.2.

viirya pushed a commit that referenced this pull request Jul 3, 2021
…icient

### What changes were proposed in this pull request?

This PR uses 2 ideas to make `EquivalentExpressions` more efficient:
1. do not keep all the equivalent expressions, we only need a count
2. track the "height" of common subexpressions, to quickly do child-parent sort, and filter out non-child expressions in `addCommonExprs`

This PR also fixes several small bugs (exposed by the refactoring), please see PR comments.

### Why are the changes needed?

code cleanup and small perf improvement

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #33142 from cloud-fan/codegen.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
(cherry picked from commit e6ce220)
Signed-off-by: Liang-Chi Hsieh <[email protected]>
@viirya viirya closed this in e6ce220 Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants