Skip to content

Conversation

@peter-toth
Copy link
Contributor

@peter-toth peter-toth commented Jul 9, 2021

What changes were proposed in this pull request?

This PR:

  • Fixes the performance issue mentioned in https://github.com/apache/spark/pull/32559/files#r633488455 and partially fixed in https://github.com/apache/spark/pull/33142/files#r660897248 using a new option to remove expressions from equivalence maps.
  • Fixes a bug with identifying common expressions in conditional expressions (a side effect of the above new approach). After this PR, add will be common subexpression in the following example:
    val add = Add(Literal(1), Literal(2))
    val ifExpr1 = If(Literal(true), add, Literal(3))
    val ifExpr3 = If(GreaterThan(add, Literal(4)), Add(ifExpr1, add), Multiply(ifExpr1, add))
    var equivalence = new EquivalentExpressions
    equivalence.addExprTree(ifExpr3)
    
  • Fixes a bug of transparently canonicalized expressions (like PromotePrecision) are considered common subexpressions.
    After this PR, transparent will not be common subexpression in the following example:
    val add = Add(Literal(1), Literal(2))
    val transparent = PromotePrecision(add)
    var equivalence = new EquivalentExpressions
    equivalence.addExprTree(transparent)
    

Why are the changes needed?

Bugfix + performance improvement.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing + new UTs.

@github-actions github-actions bot added the SQL label Jul 9, 2021
@SparkQA
Copy link

SparkQA commented Jul 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45363/

@SparkQA
Copy link

SparkQA commented Jul 9, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45363/

@peter-toth peter-toth changed the title [SPARK-36073][SQL] SubExpr elimination should include common child exprs of conditional expressions [WIP][SPARK-36073][SQL] SubExpr elimination should include common child exprs of conditional expressions Jul 9, 2021
@SparkQA
Copy link

SparkQA commented Jul 9, 2021

Test build #140857 has started for PR 33281 at commit 7786c0c.

@SparkQA
Copy link

SparkQA commented Jul 9, 2021

Test build #140852 has finished for PR 33281 at commit 102942c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ExpressionStats(expr: Expression)(var useCount: Int)

@SparkQA
Copy link

SparkQA commented Jul 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45368/

@peter-toth peter-toth changed the title [WIP][SPARK-36073][SQL] SubExpr elimination should include common child exprs of conditional expressions [SPARK-36073][SQL] EquivalentExpressions fixes and improvement Jul 10, 2021
@peter-toth peter-toth changed the title [SPARK-36073][SQL] EquivalentExpressions fixes and improvement [SPARK-36073][SQL] EquivalentExpressions fixes and improvements Jul 10, 2021
@peter-toth
Copy link
Contributor Author

cc @cloud-fan, @Kimahriman, @maropu, @viirya

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the behavior of this?

Copy link
Contributor Author

@peter-toth peter-toth Jul 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test actually proves that the new updateExprTree(stats.expr, localEquivalenceMap, -stats.useCount) approach fixes a bug as well.
Before this PR equivalence.getAllExprStates(1) didn't return anything because the notChild filter at https://github.com/apache/spark/pull/33281/files#diff-4d8c210a38fc808fef3e5c966b438591f225daa3c9fd69359446b94c351aa11eL90-L92 filtered out all child expressions (including add) of the common expression ifExpr1. But it should filter out only children "defined by" childrenToRecurse() and commonChildrenToRecurse(). In this PR when we remove ifExpr1 from localEquivalenceMap we keep add in localEquivalenceMap. Then we add add to map (in the 2nd iteration of the loop) and so add will have useCount = 2 in the end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does transparently canonicalized mean PromotePrecision(add).canonicalized == add.canonicalized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, maybe I could rephrase this if it doesn't make sense.

@viirya
Copy link
Member

viirya commented Jul 10, 2021

Fixes the performance issue mentioned in https://github.com/apache/spark/pull/32559/files#r633488455 and partially fixed in https://github.com/apache/spark/pull/33142/files#r660897248 using a new option to remove expressions from equivalence maps.
Fixes an issue with identifying common expressions in conditional expressions (a side effect of the above).
Fixes the issue of transparently canonicalized expressions (like PromotePrecision) are considered common subexpressions.

Could you explain more about the three points listed in the description?

Comment on lines +119 to +120
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I got your idea here. For performance, is there significant difference? The current approach is simpler. The filtering in localEquivalenceMap is based on height so it should be very fast. Is this still a performance bottleneck?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that this change complicates the computation of useCount. It will be harder to debug in the future. Before we are not certain that this is a real performance bottleneck, this may look like a premature optimization.

Copy link
Contributor Author

@peter-toth peter-toth Jul 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I'm not sure how significant the difference is. I think if we have deep expressions in localEquivalenceMap then filtering by height (before this PR) might not help a lot.
The new code in this PR might look a bit complicated at first, but actually it is very simple, we just remove expressions from the localEquivalenceMap with the reverse of addExprTree().

This new approach also fixes a bug tested here: https://github.com/apache/spark/pull/33281/files#r667343014

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I'd rather consider it as an improvement as it doesn't cause query failure or codegen failure, though it fails to identify a common subexpression.

That's said, we don't need to hurry on this for 3.2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, I've changed ticket type to improvement.

@peter-toth
Copy link
Contributor Author

Fixes the performance issue mentioned in https://github.com/apache/spark/pull/32559/files#r633488455 and partially fixed in https://github.com/apache/spark/pull/33142/files#r660897248 using a new option to remove expressions from equivalence maps.
Fixes an issue with identifying common expressions in conditional expressions (a side effect of the above).
Fixes the issue of transparently canonicalized expressions (like PromotePrecision) are considered common subexpressions.

Could you explain more about the three points listed in the description?

Sure. I've updated the description and added some examples.
If we consider this PR too complex then maybe I should split the 2 commits into 2 separate PRs. The first commit belongs to the first 2 points, the second commit belongs to the 3rd.

@peter-toth
Copy link
Contributor Author

Gentle ping @viirya, @cloud-fan.

@peter-toth
Copy link
Contributor Author

@viirya, @cloud-fan I wonder if we can move forward with this?

@SparkQA
Copy link

SparkQA commented Oct 29, 2021

Test build #144726 has finished for PR 33281 at commit 7786c0c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ExpressionStats(expr: Expression)(var useCount: Int)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's follow SPARK-33539 and add a new method in QueryExecutionErrors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @allisonwang-db. Added it in 227cad1

@peter-toth peter-toth force-pushed the SPARK-36073-sub-expr-common-children branch from 7786c0c to 227cad1 Compare November 9, 2021 15:38
@SparkQA
Copy link

SparkQA commented Nov 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49506/

@SparkQA
Copy link

SparkQA commented Nov 9, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49506/

@SparkQA
Copy link

SparkQA commented Nov 9, 2021

Test build #145034 has finished for PR 33281 at commit 227cad1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

map -= wrapper
false
} else {
// Should not happen
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this should not happen, I think throwing IllegalStateException is better as it's a bug. QueryExecutionErrors is for user-facing errors.

Copy link
Contributor Author

@peter-toth peter-toth Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throwing IllegalStateException looks reasonable to me. Fixed in c7c7016

map.put(wrapper, ExpressionStats(expr)(useCount))
} else {
// Should not happen
throw QueryExecutionErrors.updateEquivalentExpressionsError(expr, map, useCount)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

@peter-toth peter-toth Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c7c7016

@peter-toth peter-toth force-pushed the SPARK-36073-sub-expr-common-children branch from 7c4d506 to c7c7016 Compare November 10, 2021 15:57
@SparkQA
Copy link

SparkQA commented Nov 10, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49538/

@SparkQA
Copy link

SparkQA commented Nov 10, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49538/

@SparkQA
Copy link

SparkQA commented Nov 10, 2021

Test build #145069 has finished for PR 33281 at commit c7c7016.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in f153029 Nov 11, 2021
@peter-toth
Copy link
Contributor Author

Thanks for the review @cloud-fan!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants