-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24893] [SQL] Remove the entire CaseWhen if all the outputs are semantic equivalence #21852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This PR also has the similar issue if a condition has a side effect. |
|
Test build #93460 has finished for PR 21852 at commit
|
| CaseWhen( h :+ t.head, None) | ||
|
|
||
| case e @ CaseWhen(branches, Some(elseValue)) if { | ||
| // With previous rules, it's guaranteed that there must be one branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I removed the comment. Thanks.
| CaseWhen(normalBranch :: trueBranch :: Nil, None)) | ||
| } | ||
|
|
||
| test("remove entire CaseWhen if all the outputs are semantic equivalence") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need test case including non deterministic cond.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I plan to add couple more tests tonight.
| // For non-deterministic conditions with side effect, we can not remove it. | ||
| // Since the output of all the branches are semantic equivalence, `elseValue` | ||
| // is picked for all the branches. | ||
| val newBranches = branches.map(_._1).filter(!_.deterministic).map(cond => (cond, elseValue)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All conds must be deterministic, otherwise a non deterministic one not run before can be run after this rule.
|
Test build #93576 has finished for PR 21852 at commit
|
|
Test build #93578 has finished for PR 21852 at commit
|
| } => | ||
| // For non-deterministic conditions with side effect, we can not remove it, or change | ||
| // the ordering. As a result, we try to remove the deterministic conditions from the tail. | ||
| val newBranches = branches.map(_._1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya I think this can address your concern. Thanks.
|
Test build #93627 has finished for PR 21852 at commit
|
|
Test build #93628 has finished for PR 21852 at commit
|
|
Test build #93643 has finished for PR 21852 at commit
|
|
+cc @cloud-fan and @gatorsmile |
| val (h, t) = branches.span(_._1 != TrueLiteral) | ||
| CaseWhen( h :+ t.head, None) | ||
|
|
||
| case e @ CaseWhen(branches, Some(elseValue)) if { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we apply this optimization when there is no elseValue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can not. When no elseValue, all the conditions are required to evaluated before hitting the default elseValue which is null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see. Another optimization is: we can remove branches that have the same the condition. We can do it in next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan It sounds like what this #21904 proposes for?
|
|
||
| case e @ CaseWhen(branches, Some(elseValue)) if { | ||
| val values = branches.map(_._2) :+ elseValue | ||
| values.tail.forall(values.head.semanticEquals) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the case is: remove branches that have the same value of else:
elseValue.deterministic && branches.exists(_._2.semanticEquals(elseValue))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this rule, all the output values have to be the same, so exits is not strong enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced the cond by branches.forall(_._2.semanticEquals(elseValue)) which is simpler.
|
Test build #93800 has finished for PR 21852 at commit
|
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| CaseWhen( h :+ t.head, None) | ||
|
|
||
| case e @ CaseWhen(branches, Some(elseValue)) if branches | ||
| .forall(_._2.semanticEquals(elseValue)) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code style nit:
case e @ CaseWhen(branches, Some(elseValue))
if branches.forall(_._2.semanticEquals(elseValue))
| case e @ CaseWhen(branches, Some(elseValue)) if branches | ||
| .forall(_._2.semanticEquals(elseValue)) => | ||
| // For non-deterministic conditions with side effect, we can not remove it, or change | ||
| // the ordering. As a result, we try to remove the deterministic conditions from the tail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more readable to write java style code here
var hitNonDetermin = false
var i = branches.length
while (i > 0 && !hitNonDetermin) {
hitNonDetermin = !branches(i - 1).deterministic
i -= 1
}
if (i == 0) {
elseValue
} else {
e.copy(branches = branches.take(i))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be
hitNonDetermin = !branches(i - 1).deterministic
if (!hitNonDetermin) {
i -= 1
}Personally, I like functional style more, but it's more efficient to use Java style here. I updated as you suggested.
|
LGTM |
|
Test build #93845 has finished for PR 21852 at commit
|
|
Test build #93847 has finished for PR 21852 at commit
|
|
thanks, merging to master! |
| var i = branches.length | ||
| while (i > 0 && !hitNonDeterministicCond) { | ||
| hitNonDeterministicCond = !branches(i - 1)._1.deterministic | ||
| if (!hitNonDeterministicCond) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can avoid this per-iteration if check by updating the final step
if (i == 0 && !hitNonDeterministicCond) {
elseValue
} else {
e.copy(branches = branches.take(i + 1).map(branch => (branch._1, elseValue)))
}
feel free to change it in your next PR.
|
It seems we simplified non-deterministic expressions with aliases. for example: CREATE TABLE t(a int, b int, c int) using parquetSELECT CASE
WHEN rand(100) > 1 THEN 1
WHEN rand(100) + 1 > 1000 THEN 1
WHEN rand(100) + 2 < 100 THEN 1
ELSE 1
END AS x
FROM t The plan is: SELECT CASE
WHEN rd > 1 THEN 1
WHEN rd + 1 > 1000 THEN 1
WHEN rd + 2 < 100 THEN 1
ELSE 1
END AS x
FROM (SELECT *, rand(100) as rd FROM t) t1 The plan is: |
|
Hm, seems like it'd be a correctness bug. @wangyum would you mind filing a JIRA and set it as a blocker? |
|
@wangyum this looks correct. After However, I do see a problem: Seems we shouldn't eliminate the project with |
|
I agree with @cloud-fan. The |
What changes were proposed in this pull request?
Similar to SPARK-24890, if all the outputs of
CaseWhenare semantic equivalence,CaseWhencan be removed.How was this patch tested?
Tests added.