[SPARK-33315][SQL] Simplify CaseWhen with EqualTo #30222

wangyum · 2020-11-02T07:02:10Z

What changes were proposed in this pull request?

This pr simplify CaseWhen with EqualTo if all values are Literal, this is a real case from production:

create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as                                                             
select 'a' as event_type, * from t1                                                
union all                                                                          
select CASE WHEN id = 1 THEN 'b' WHEN id = 3 THEN 'c' end as event_type, * from t2 

explain select * from v1 where event_type = 'a';

Before this PR:

== Physical Plan ==
Union
:- *(1) Project [a AS event_type#30533, id#30535L]
:  +- *(1) ColumnarToRow
:     +- FileScan parquet default.t1[id#30535L] Batched: true, DataFilters: [], Format: Parquet
+- *(2) Project [CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END AS event_type#30534, id#30536L]
   +- *(2) Filter (CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END = a)
      +- *(2) ColumnarToRow
         +- FileScan parquet default.t2[id#30536L] Batched: true, DataFilters: [(CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END = a)], Format: Parquet

After this PR:

== Physical Plan ==
*(1) Project [a AS event_type#8, id#4L]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], Format: Parquet

Why are the changes needed?

Improve query performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

SparkQA · 2020-11-02T07:47:53Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35118/

SparkQA · 2020-11-02T08:05:02Z

Test build #130518 has finished for PR 30222 at commit 3a1cd10.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-02T08:16:58Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35118/

wangyum · 2020-11-02T08:29:50Z

retest this please.

SparkQA · 2020-11-02T09:16:26Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35121/

SparkQA · 2020-11-02T09:44:59Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35121/

SparkQA · 2020-11-02T12:52:59Z

Test build #130521 has finished for PR 30222 at commit 3a1cd10.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

...talyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/SimplifyConditionalSuite.scala

dongjoon-hyun · 2020-11-02T18:13:31Z

Also, cc @cloud-fan and @sunchao

SparkQA · 2020-11-03T02:44:57Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35141/

SparkQA · 2020-11-03T03:12:06Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35141/

SparkQA · 2020-11-03T06:51:29Z

Test build #130541 has finished for PR 30222 at commit 593678c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2020-11-04T02:20:49Z

Hive optimized it to predicate: CASE WHEN ((a = 100)) THEN (false) WHEN ((b > 1000)) THEN (true) WHEN (c is not null) THEN (false) ELSE (null) END (type: boolean). But this condition can not push down. We can optimized it to b > 1000 and push down it.

hive> explain SELECT *
    > FROM   (SELECT CASE
    >                  WHEN a = 100 THEN 1
    >                  WHEN b > 1000 THEN 2
    >                  WHEN c IS NOT NULL THEN 3
    >                END AS x
    >         FROM   t) tmp
    > WHERE  x = 2;
OK
STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: t
          Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
          Filter Operator
            predicate: CASE WHEN ((a = 100)) THEN (false) WHEN ((b > 1000)) THEN (true) WHEN (c is not null) THEN (false) ELSE (null) END (type: boolean)
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Select Operator
              expressions: CASE WHEN ((a = 100)) THEN (1) WHEN ((b > 1000)) THEN (2) WHEN (c is not null) THEN (3) ELSE (null) END (type: int)
              outputColumnNames: _col0
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              ListSink

cloud-fan · 2020-11-04T04:40:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

        }
+
+      case EqualTo(CaseWhen(branches, _), right)
+          if branches.count(_._2.semanticEquals(right)) == 1 =>


if there are more than one matches, shall we combine the conditions with Or?

cloud-fan · 2020-11-04T04:45:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

          e.copy(branches = branches.take(i).map(branch => (branch._1, elseValue)))
        }
+
+      case EqualTo(CaseWhen(branches, _), right)


I'm a bit worried about dropping other branches in CASE WHEN. a.semanticEquals(b) means a is always equal to b. But !a.semanticEquals(b) doesn't mean that a will never be equal to b.

As an example (CASE WHEN a=1 THEN 1 ELSE b) = 1 can be true if a=1 or b=1.

SparkQA · 2020-11-05T02:39:55Z

Test build #130629 has finished for PR 30222 at commit ee5e6dd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-05T02:40:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35235/

SparkQA · 2020-11-05T03:01:24Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35235/

SparkQA · 2020-11-05T03:06:55Z

Test build #130630 has finished for PR 30222 at commit b611659.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-05T03:14:46Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35237/

SparkQA · 2020-11-05T03:45:48Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35237/

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

cloud-fan

LGTM except one minor comment.

SparkQA · 2020-11-05T21:07:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35276/

SparkQA · 2020-11-05T21:34:21Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35276/

wangyum · 2020-11-06T01:37:10Z

It seems it is caused by deterministic. cc @viirya

== Analyzed Logical Plan ==
label: double, features: vector, fold: int
Filter (UDF(fold#14) AND NOT (fold#14 = 2))
+- Repartition 2, true
   +- Project [label#3, features#4, fold#14]
      +- Project [label#3, features#4, random#10, CASE WHEN (random#10 < 0.33) THEN 0 WHEN (random#10 < 0.66) THEN 1 ELSE 2 END AS fold#14]
         +- Project [label#3, features#4, rand(100) AS random#10]
            +- Repartition 1, true
               +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.ml.feature.LabeledPoint, true])).label AS label#3, newInstance(class org.apache.spark.ml.linalg.VectorUDT).serialize AS features#4]
                  +- ExternalRDD [obj#2]

== Optimized Logical Plan ==
LocalRelation <empty>, [label#3, features#4, fold#14]

HyukjinKwon · 2020-11-06T04:28:58Z

@wangyum, it's #21852 right? Can you file a blocker JIRA?

cloud-fan · 2020-11-06T05:19:25Z

@wangyum do you know how we optimize the plan wrongly step by step?

wangyum · 2020-11-06T05:52:49Z

We can reproduce it by:

spark.sql("CREATE TABLE t(a int, b int, c int) using parquet")
spark.sql(
  """
    |SELECT *
    |  FROM   (SELECT CASE
    |    WHEN rd > 1 THEN 1
    |    WHEN b > 1000 THEN 2
    |    WHEN c < 100 THEN 3
    |    ELSE 4
    |END AS x
    |FROM (SELECT *, rand(100) as rd FROM t) t1) t2
    |WHERE  x = 2
    |""".stripMargin).explain

Alias.toAttribute construct AttributeReference with default deterministic, that is true:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala

Line 181 in ca2cfd4

AttributeReference(name, child.dataType, child.nullable, metadata)(exprId, qualifier)
Therefore, deterministic is true, andSimplifyConditionals can simplify it:

cloud-fan · 2020-11-06T05:59:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

        }
+
+      case EqualTo(c @ CaseWhen(branches, elseValue), right)
+          if c.deterministic &&


More precisely, I think we only need to make sure the skipped branches are all deterministic.

val (picked, skipped) = branches.partition(_._2.equals(right)) if (skipped.forall(_._1.determinisitc)) { ... } else { original }

SparkQA · 2020-11-06T06:45:51Z

Test build #130694 has finished for PR 30222 at commit 5a90bfc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-06T06:51:31Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35304/

dongjoon-hyun · 2020-11-06T06:59:10Z

This seems to fail still.

SparkQA · 2020-11-06T07:12:52Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35304/

wangyum · 2020-11-06T08:48:39Z

Sorry. This change has logic issue, for example:

spark.sql("CREATE TABLE t using parquet AS SELECT if(id % 2 = 7, null, id) AS a FROM range(7)")
spark.sql(
  """
    |SELECT *
    |  FROM   (SELECT CASE
    |    WHEN a > 1 THEN 1
    |    WHEN a > 3 THEN 3
    |    WHEN a > 5 THEN 5
    |    ELSE 6
    |END AS x
    |FROM t ) t1
    |WHERE x = 3
    |""".stripMargin).show

Before this pr, the result is empty, after this pr, the result is not empty.

cloud-fan · 2020-11-06T09:17:31Z

I see, the case when conditions are not orthogonal. We can't skip any of them.

dongjoon-hyun · 2020-11-06T20:45:10Z

Thank you for your decision, @wangyum and @cloud-fan .

SparkQA · 2020-12-11T15:53:41Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37273/

SparkQA · 2020-12-11T16:24:54Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37273/

SparkQA · 2020-12-11T18:35:48Z

Test build #132669 has finished for PR 30222 at commit 312c613.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2020-12-12T08:48:28Z

@cloud-fan @dongjoon-hyun We can improve the following case to reduce Union operator:

create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as                                                             
select 'a' as event_type, * from t1                                                
union all                                                                          
select CASE WHEN id = 1 THEN 'b' WHEN id = 3 THEN 'c' end as event_type, * from t2;

explain select * from v1 where event_type = 'a';
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#8, id#10L]
:  +- *(1) ColumnarToRow
:     +- FileScan parquet default.t1[id#10L] Batched: true, DataFilters: [], Format: Parquet,
+- *(2) Project [CASE WHEN (id#11L = 1) THEN b WHEN (id#11L = 3) THEN c END AS event_type#9, id#11L]
   +- *(2) Filter (CASE WHEN (id#11L = 1) THEN b WHEN (id#11L = 3) THEN c END = a)
      +- *(2) ColumnarToRow
         +- FileScan parquet default.t2[id#11L] Batched: true, DataFilters: [(CASE WHEN (id#11L = 1) THEN b WHEN (id#11L = 3) THEN c END = a)], Format: Parquet


explain select * from v1 where event_type = 'b';
== Physical Plan ==
*(1) Project [CASE WHEN (id#11L = 1) THEN b WHEN (id#11L = 3) THEN c END AS event_type#8, id#11L AS id#10L]
+- *(1) Filter (CASE WHEN (id#11L = 1) THEN b WHEN (id#11L = 3) THEN c END = b)
   +- *(1) ColumnarToRow
      +- FileScan parquet default.t2[id#11L] Batched: true, DataFilters: [(CASE WHEN (id#11L = 1) THEN b WHEN (id#11L = 3) THEN c END = b)], Format: Parquet

cloud-fan · 2020-12-14T08:36:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+          if c.deterministic &&
+            right.isInstanceOf[Literal] && branches.forall(_._2.isInstanceOf[Literal]) &&
+            elseValue.forall(_.isInstanceOf[Literal]) =>
+        if ((branches.map(_._2) ++ elseValue).forall(!_.equals(right))) {


can we use an EqualTo expression to compare literals? and how about the null semantic?

cloud-fan · 2020-12-14T08:37:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+            right.isInstanceOf[Literal] && branches.forall(_._2.isInstanceOf[Literal]) &&
+            elseValue.forall(_.isInstanceOf[Literal]) =>
+        if ((branches.map(_._2) ++ elseValue).forall(!_.equals(right))) {
+          FalseLiteral


Let's update the JIRA/PR title, as it's a different optimization now.

https://github.com/apache/spark/pull/30790/files

simplify CaseWhen with EqualTo

3a1cd10

dongjoon-hyun reviewed Nov 2, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Nov 2, 2020

View reviewed changes

...talyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/SimplifyConditionalSuite.scala Outdated Show resolved Hide resolved

Address comment

593678c

cloud-fan reviewed Nov 4, 2020

View reviewed changes

wangyum added 3 commits November 5, 2020 09:57

fix

ee5e6dd

Add more test

5af6ab3

simplify test

b611659

cloud-fan reviewed Nov 5, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Nov 5, 2020

View reviewed changes

Address comments

7d7eca3

Add deterministic check

5a90bfc

cloud-fan reviewed Nov 6, 2020

View reviewed changes

wangyum closed this Nov 6, 2020

wangyum deleted the SPARK-33315 branch November 6, 2020 09:37

wangyum restored the SPARK-33315 branch December 11, 2020 10:22

wangyum added 2 commits December 11, 2020 19:11

Merge remote-tracking branch 'upstream/master' into SPARK-33315

0099d2a

Another case

312c613

wangyum reopened this Dec 11, 2020

github-actions bot added the SQL label Dec 11, 2020

cloud-fan reviewed Dec 14, 2020

View reviewed changes

wangyum closed this Dec 16, 2020

wangyum deleted the SPARK-33315 branch December 16, 2020 01:37

[SPARK-33315][SQL] Simplify CaseWhen with EqualTo #30222

[SPARK-33315][SQL] Simplify CaseWhen with EqualTo #30222

Uh oh!

Conversation

wangyum commented Nov 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Nov 2, 2020

Uh oh!

SparkQA commented Nov 2, 2020

Uh oh!

SparkQA commented Nov 2, 2020

Uh oh!

wangyum commented Nov 2, 2020

Uh oh!

SparkQA commented Nov 2, 2020

Uh oh!

SparkQA commented Nov 2, 2020

Uh oh!

SparkQA commented Nov 2, 2020

Uh oh!

Uh oh!

Uh oh!

dongjoon-hyun commented Nov 2, 2020

Uh oh!

SparkQA commented Nov 3, 2020

Uh oh!

SparkQA commented Nov 3, 2020

Uh oh!

SparkQA commented Nov 3, 2020

Uh oh!

wangyum commented Nov 4, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 5, 2020

Uh oh!

SparkQA commented Nov 5, 2020

Uh oh!

SparkQA commented Nov 5, 2020

Uh oh!

SparkQA commented Nov 5, 2020

Uh oh!

SparkQA commented Nov 5, 2020

Uh oh!

SparkQA commented Nov 5, 2020

Uh oh!

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 5, 2020

Uh oh!

SparkQA commented Nov 5, 2020

Uh oh!

wangyum commented Nov 6, 2020

Uh oh!

HyukjinKwon commented Nov 6, 2020

Uh oh!

cloud-fan commented Nov 6, 2020

Uh oh!

wangyum commented Nov 6, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 6, 2020

Uh oh!

SparkQA commented Nov 6, 2020

Uh oh!

dongjoon-hyun commented Nov 6, 2020

wangyum commented Nov 2, 2020 •

edited

Loading