[SPARK-24893] [SQL] Remove the entire CaseWhen if all the outputs are semantic equivalence #21852

dbtsai · 2018-07-23T19:17:48Z

What changes were proposed in this pull request?

Similar to SPARK-24890, if all the outputs of CaseWhen are semantic equivalence, CaseWhen can be removed.

How was this patch tested?

Tests added.

kiszk · 2018-07-23T19:51:19Z

This PR also has the similar issue if a condition has a side effect.

SparkQA · 2018-07-23T23:25:10Z

Test build #93460 has finished for PR 21852 at commit dc8de5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-07-26T00:47:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

        CaseWhen( h :+ t.head, None)
+
+      case e @ CaseWhen(branches, Some(elseValue)) if {
+        // With previous rules, it's guaranteed that there must be one branch.


Is this comment correct?

You're right. I removed the comment. Thanks.

viirya · 2018-07-26T00:48:29Z

...talyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/SimplifyConditionalSuite.scala

      CaseWhen(normalBranch :: trueBranch :: Nil, None))
  }
+
+  test("remove entire CaseWhen if all the outputs are semantic equivalence") {


We may need test case including non deterministic cond.

Yes, I plan to add couple more tests tonight.

viirya · 2018-07-26T01:40:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+        // For non-deterministic conditions with side effect, we can not remove it.
+        // Since the output of all the branches are semantic equivalence, `elseValue`
+        // is picked for all the branches.
+        val newBranches = branches.map(_._1).filter(!_.deterministic).map(cond => (cond, elseValue))


All conds must be deterministic, otherwise a non deterministic one not run before can be run after this rule.

SparkQA · 2018-07-26T04:36:53Z

Test build #93576 has finished for PR 21852 at commit 0b67e2e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-26T04:58:03Z

Test build #93578 has finished for PR 21852 at commit 4acda6f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2018-07-26T20:54:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+      } =>
+        // For non-deterministic conditions with side effect, we can not remove it, or change
+        // the ordering. As a result, we try to remove the deterministic conditions from the tail.
+        val newBranches = branches.map(_._1)


@viirya I think this can address your concern. Thanks.

SparkQA · 2018-07-27T00:43:48Z

Test build #93627 has finished for PR 21852 at commit dde7959.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-27T01:14:08Z

Test build #93628 has finished for PR 21852 at commit 4d1e55e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-27T04:31:57Z

Test build #93643 has finished for PR 21852 at commit 9171773.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2018-07-27T16:31:39Z

+cc @cloud-fan and @gatorsmile

cloud-fan · 2018-07-28T00:09:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

        val (h, t) = branches.span(_._1 != TrueLiteral)
        CaseWhen( h :+ t.head, None)
+
+      case e @ CaseWhen(branches, Some(elseValue)) if {


can we apply this optimization when there is no elseValue?

We can not. When no elseValue, all the conditions are required to evaluated before hitting the default elseValue which is null.

ah i see. Another optimization is: we can remove branches that have the same the condition. We can do it in next PR.

@cloud-fan It sounds like what this #21904 proposes for?

cloud-fan · 2018-07-30T02:17:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+
+      case e @ CaseWhen(branches, Some(elseValue)) if {
+        val values = branches.map(_._2) :+ elseValue
+        values.tail.forall(values.head.semanticEquals)


I think the case is: remove branches that have the same value of else:

elseValue.deterministic && branches.exists(_._2.semanticEquals(elseValue))

For this rule, all the output values have to be the same, so exits is not strong enough.

I replaced the cond by branches.forall(_._2.semanticEquals(elseValue)) which is simpler.

SparkQA · 2018-07-30T22:20:57Z

Test build #93800 has finished for PR 21852 at commit 584ec81.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan

LGTM

cloud-fan · 2018-07-31T12:05:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

        CaseWhen( h :+ t.head, None)
+
+      case e @ CaseWhen(branches, Some(elseValue)) if branches
+        .forall(_._2.semanticEquals(elseValue)) =>


code style nit:

case e @ CaseWhen(branches, Some(elseValue)) if branches.forall(_._2.semanticEquals(elseValue))

cloud-fan · 2018-07-31T12:10:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+      case e @ CaseWhen(branches, Some(elseValue)) if branches
+        .forall(_._2.semanticEquals(elseValue)) =>
+        // For non-deterministic conditions with side effect, we can not remove it, or change
+        // the ordering. As a result, we try to remove the deterministic conditions from the tail.


I think it's more readable to write java style code here

var hitNonDetermin = false var i = branches.length while (i > 0 && !hitNonDetermin) { hitNonDetermin = !branches(i - 1).deterministic i -= 1 } if (i == 0) { elseValue } else { e.copy(branches = branches.take(i)) }

Should be

hitNonDetermin = !branches(i - 1).deterministic if (!hitNonDetermin) { i -= 1 }

Personally, I like functional style more, but it's more efficient to use Java style here. I updated as you suggested.

viirya · 2018-07-31T22:18:13Z

LGTM

SparkQA · 2018-08-01T01:27:40Z

Test build #93845 has finished for PR 21852 at commit 65fb8c2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-01T01:39:42Z

Test build #93847 has finished for PR 21852 at commit b3d5a4a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-01T02:31:34Z

thanks, merging to master!

cloud-fan · 2018-08-01T02:33:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+        var i = branches.length
+        while (i > 0 && !hitNonDeterministicCond) {
+          hitNonDeterministicCond = !branches(i - 1)._1.deterministic
+          if (!hitNonDeterministicCond) {


nit: we can avoid this per-iteration if check by updating the final step

if (i == 0 && !hitNonDeterministicCond) { elseValue } else { e.copy(branches = branches.take(i + 1).map(branch => (branch._1, elseValue))) }

feel free to change it in your next PR.

wangyum · 2020-11-06T02:30:15Z

It seems we simplified non-deterministic expressions with aliases. for example:

CREATE TABLE t(a int, b int, c int) using parquet

SELECT CASE                          
    WHEN rand(100) > 1 THEN 1        
    WHEN rand(100) + 1 > 1000 THEN 1 
    WHEN rand(100) + 2 < 100 THEN 1  
    ELSE 1                           
END AS x                             
FROM t

The plan is:

== Physical Plan ==
*(1) Project [CASE WHEN (rand(100) > 1.0) THEN 1 WHEN ((rand(100) + 1.0) > 1000.0) THEN 1 WHEN ((rand(100) + 2.0) < 100.0) THEN 1 ELSE 1 END AS x#6]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t[] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/sql/core/spark-warehouse/org.apache.spark...., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>

SELECT CASE                                  
    WHEN rd > 1 THEN 1                       
    WHEN rd + 1 > 1000 THEN 1                
    WHEN rd + 2 < 100 THEN 1                 
    ELSE 1                                   
END AS x                                     
FROM (SELECT *, rand(100) as rd FROM t) t1

The plan is:

== Physical Plan ==
*(1) Project [1 AS x#1]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t[] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/sql/core/spark-warehouse/org.apache.spark...., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>

HyukjinKwon · 2020-11-06T04:28:15Z

Hm, seems like it'd be a correctness bug. @wangyum would you mind filing a JIRA and set it as a blocker?

cloud-fan · 2020-11-06T05:10:32Z

@wangyum this looks correct. After SELECT *, rand(100) as rd FROM t, the output column rd is deterministic, as the result of rand(100) is materialized. We can treat it like writing SELECT *, rand(100) as rd FROM t to a table and read back.

However, I do see a problem:

scala> sql("select 1 FROM (SELECT *, rand(100) as rd FROM t) t1").explain
== Physical Plan ==
*(1) Project [1 AS 1#4]
+- FileScan json default.t[] Batched: false, DataFilters: [], Format: JSON, Location: InMemoryFileIndex[file:/Users/cloud0fan/dev/spark/assembly/spark-warehouse/t], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>

Seems we shouldn't eliminate the project with rand(100). But it's not related to this PR.

viirya · 2020-11-06T06:00:34Z

I agree with @cloud-fan. The rd in CaseWhen is deterministic already. If it is embeded in CaseWhen, this PR should not remove it.

remove casewhen if possible

dc8de5f

dbtsai added 2 commits July 25, 2018 17:36

Merge branch 'master' into short-circuit-when

0b67e2e

typo

30a45dc

viirya reviewed Jul 26, 2018

View reviewed changes

remove comment

4acda6f

viirya reviewed Jul 26, 2018

View reviewed changes

Fixed a bug

dde7959

dbtsai commented Jul 26, 2018

View reviewed changes

dbtsai added 2 commits July 26, 2018 14:00

Simplify the code

4d1e55e

added more tests

9171773

dbtsai mentioned this pull request Jul 27, 2018

[SPARK-24892] [SQL] Simplify CaseWhen to If when there is only one branch #21850

Closed

cloud-fan reviewed Jul 28, 2018

View reviewed changes

viirya mentioned this pull request Jul 28, 2018

[SPARK-24953] [SQL] Prune a branch in CaseWhen if previously seen #21904

Closed

cloud-fan reviewed Jul 30, 2018

View reviewed changes

simplify the cond

584ec81

cloud-fan reviewed Jul 31, 2018

View reviewed changes

dbtsai added 2 commits July 31, 2018 14:25

address feedback

65fb8c2

small change

b3d5a4a

cloud-fan reviewed Aug 1, 2018

View reviewed changes

asfgit closed this in 5f3441e Aug 1, 2018

HyukjinKwon mentioned this pull request Nov 6, 2020

[SPARK-33315][SQL] Simplify CaseWhen with EqualTo #30222

Closed

[SPARK-24893] [SQL] Remove the entire CaseWhen if all the outputs are semantic equivalence #21852

[SPARK-24893] [SQL] Remove the entire CaseWhen if all the outputs are semantic equivalence #21852

Uh oh!

Conversation

dbtsai commented Jul 23, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

kiszk commented Jul 23, 2018

Uh oh!

SparkQA commented Jul 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 26, 2018

Uh oh!

SparkQA commented Jul 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 27, 2018

Uh oh!

SparkQA commented Jul 27, 2018

Uh oh!

SparkQA commented Jul 27, 2018

Uh oh!

dbtsai commented Jul 27, 2018

Uh oh!

cloud-fan Jul 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jul 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dbtsai Jul 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 30, 2018

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Jul 31, 2018

Uh oh!

SparkQA commented Aug 1, 2018

Uh oh!

SparkQA commented Aug 1, 2018

Uh oh!

cloud-fan commented Aug 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangyum commented Nov 6, 2020

Uh oh!

HyukjinKwon commented Nov 6, 2020

cloud-fan Jul 28, 2018 •

edited

Loading

cloud-fan Jul 30, 2018 •

edited

Loading

dbtsai Jul 30, 2018 •

edited

Loading