-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-33315][SQL] Simplify CaseWhen with EqualTo #30222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Kubernetes integration test starting |
|
Test build #130518 has finished for PR 30222 at commit
|
|
Kubernetes integration test status success |
|
retest this please. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #130521 has finished for PR 30222 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
Outdated
Show resolved
Hide resolved
...talyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/SimplifyConditionalSuite.scala
Outdated
Show resolved
Hide resolved
|
Also, cc @cloud-fan and @sunchao |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #130541 has finished for PR 30222 at commit
|
|
Hive optimized it to |
| } | ||
|
|
||
| case EqualTo(CaseWhen(branches, _), right) | ||
| if branches.count(_._2.semanticEquals(right)) == 1 => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there are more than one matches, shall we combine the conditions with Or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
| e.copy(branches = branches.take(i).map(branch => (branch._1, elseValue))) | ||
| } | ||
|
|
||
| case EqualTo(CaseWhen(branches, _), right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried about dropping other branches in CASE WHEN. a.semanticEquals(b) means a is always equal to b. But !a.semanticEquals(b) doesn't mean that a will never be equal to b.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an example (CASE WHEN a=1 THEN 1 ELSE b) = 1 can be true if a=1 or b=1.
|
Test build #130629 has finished for PR 30222 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #130630 has finished for PR 30222 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
Outdated
Show resolved
Hide resolved
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except one minor comment.
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
It seems it is caused by deterministic. cc @viirya |
|
@wangyum do you know how we optimize the plan wrongly step by step? |
|
We can reproduce it by: spark.sql("CREATE TABLE t(a int, b int, c int) using parquet")
spark.sql(
"""
|SELECT *
| FROM (SELECT CASE
| WHEN rd > 1 THEN 1
| WHEN b > 1000 THEN 2
| WHEN c < 100 THEN 3
| ELSE 4
|END AS x
|FROM (SELECT *, rand(100) as rd FROM t) t1) t2
|WHERE x = 2
|""".stripMargin).explain
|
| } | ||
|
|
||
| case EqualTo(c @ CaseWhen(branches, elseValue), right) | ||
| if c.deterministic && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More precisely, I think we only need to make sure the skipped branches are all deterministic.
val (picked, skipped) = branches.partition(_._2.equals(right))
if (skipped.forall(_._1.determinisitc)) {
...
} else {
original
}
|
Test build #130694 has finished for PR 30222 at commit
|
|
Kubernetes integration test starting |
|
This seems to fail still. |
|
Kubernetes integration test status success |
|
Sorry. This change has logic issue, for example: spark.sql("CREATE TABLE t using parquet AS SELECT if(id % 2 = 7, null, id) AS a FROM range(7)")
spark.sql(
"""
|SELECT *
| FROM (SELECT CASE
| WHEN a > 1 THEN 1
| WHEN a > 3 THEN 3
| WHEN a > 5 THEN 5
| ELSE 6
|END AS x
|FROM t ) t1
|WHERE x = 3
|""".stripMargin).showBefore this pr, the result is empty, after this pr, the result is not empty. |
|
I see, the case when conditions are not orthogonal. We can't skip any of them. |
|
Thank you for your decision, @wangyum and @cloud-fan . |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #132669 has finished for PR 30222 at commit
|
|
@cloud-fan @dongjoon-hyun We can improve the following case to reduce create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);
create temp view v1 as
select 'a' as event_type, * from t1
union all
select CASE WHEN id = 1 THEN 'b' WHEN id = 3 THEN 'c' end as event_type, * from t2;
explain select * from v1 where event_type = 'a';
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#8, id#10L]
: +- *(1) ColumnarToRow
: +- FileScan parquet default.t1[id#10L] Batched: true, DataFilters: [], Format: Parquet,
+- *(2) Project [CASE WHEN (id#11L = 1) THEN b WHEN (id#11L = 3) THEN c END AS event_type#9, id#11L]
+- *(2) Filter (CASE WHEN (id#11L = 1) THEN b WHEN (id#11L = 3) THEN c END = a)
+- *(2) ColumnarToRow
+- FileScan parquet default.t2[id#11L] Batched: true, DataFilters: [(CASE WHEN (id#11L = 1) THEN b WHEN (id#11L = 3) THEN c END = a)], Format: Parquet
explain select * from v1 where event_type = 'b';
== Physical Plan ==
*(1) Project [CASE WHEN (id#11L = 1) THEN b WHEN (id#11L = 3) THEN c END AS event_type#8, id#11L AS id#10L]
+- *(1) Filter (CASE WHEN (id#11L = 1) THEN b WHEN (id#11L = 3) THEN c END = b)
+- *(1) ColumnarToRow
+- FileScan parquet default.t2[id#11L] Batched: true, DataFilters: [(CASE WHEN (id#11L = 1) THEN b WHEN (id#11L = 3) THEN c END = b)], Format: Parquet
|
| if c.deterministic && | ||
| right.isInstanceOf[Literal] && branches.forall(_._2.isInstanceOf[Literal]) && | ||
| elseValue.forall(_.isInstanceOf[Literal]) => | ||
| if ((branches.map(_._2) ++ elseValue).forall(!_.equals(right))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use an EqualTo expression to compare literals? and how about the null semantic?
| right.isInstanceOf[Literal] && branches.forall(_._2.isInstanceOf[Literal]) && | ||
| elseValue.forall(_.isInstanceOf[Literal]) => | ||
| if ((branches.map(_._2) ++ elseValue).forall(!_.equals(right))) { | ||
| FalseLiteral |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's update the JIRA/PR title, as it's a different optimization now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.

What changes were proposed in this pull request?
This pr simplify
CaseWhenwithEqualToif all values areLiteral, this is a real case from production:Before this PR:
After this PR:
Why are the changes needed?
Improve query performance.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit test.