Skip to content

Conversation

@dbtsai
Copy link
Member

@dbtsai dbtsai commented Jul 23, 2018

What changes were proposed in this pull request?

Similar to SPARK-24890, if all the outputs of CaseWhen are semantic equivalence, CaseWhen can be removed.

How was this patch tested?

Tests added.

@kiszk
Copy link
Member

kiszk commented Jul 23, 2018

This PR also has the similar issue if a condition has a side effect.

@SparkQA
Copy link

SparkQA commented Jul 23, 2018

Test build #93460 has finished for PR 21852 at commit dc8de5f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

CaseWhen( h :+ t.head, None)

case e @ CaseWhen(branches, Some(elseValue)) if {
// With previous rules, it's guaranteed that there must be one branch.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I removed the comment. Thanks.

CaseWhen(normalBranch :: trueBranch :: Nil, None))
}

test("remove entire CaseWhen if all the outputs are semantic equivalence") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need test case including non deterministic cond.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I plan to add couple more tests tonight.

// For non-deterministic conditions with side effect, we can not remove it.
// Since the output of all the branches are semantic equivalence, `elseValue`
// is picked for all the branches.
val newBranches = branches.map(_._1).filter(!_.deterministic).map(cond => (cond, elseValue))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All conds must be deterministic, otherwise a non deterministic one not run before can be run after this rule.

@SparkQA
Copy link

SparkQA commented Jul 26, 2018

Test build #93576 has finished for PR 21852 at commit 0b67e2e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 26, 2018

Test build #93578 has finished for PR 21852 at commit 4acda6f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

} =>
// For non-deterministic conditions with side effect, we can not remove it, or change
// the ordering. As a result, we try to remove the deterministic conditions from the tail.
val newBranches = branches.map(_._1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya I think this can address your concern. Thanks.

@SparkQA
Copy link

SparkQA commented Jul 27, 2018

Test build #93627 has finished for PR 21852 at commit dde7959.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 27, 2018

Test build #93628 has finished for PR 21852 at commit 4d1e55e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 27, 2018

Test build #93643 has finished for PR 21852 at commit 9171773.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member Author

dbtsai commented Jul 27, 2018

+cc @cloud-fan and @gatorsmile

val (h, t) = branches.span(_._1 != TrueLiteral)
CaseWhen( h :+ t.head, None)

case e @ CaseWhen(branches, Some(elseValue)) if {
Copy link
Contributor

@cloud-fan cloud-fan Jul 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we apply this optimization when there is no elseValue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can not. When no elseValue, all the conditions are required to evaluated before hitting the default elseValue which is null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see. Another optimization is: we can remove branches that have the same the condition. We can do it in next PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan It sounds like what this #21904 proposes for?


case e @ CaseWhen(branches, Some(elseValue)) if {
val values = branches.map(_._2) :+ elseValue
values.tail.forall(values.head.semanticEquals)
Copy link
Contributor

@cloud-fan cloud-fan Jul 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the case is: remove branches that have the same value of else:

elseValue.deterministic && branches.exists(_._2.semanticEquals(elseValue))

Copy link
Member Author

@dbtsai dbtsai Jul 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this rule, all the output values have to be the same, so exits is not strong enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced the cond by branches.forall(_._2.semanticEquals(elseValue)) which is simpler.

@SparkQA
Copy link

SparkQA commented Jul 30, 2018

Test build #93800 has finished for PR 21852 at commit 584ec81.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

CaseWhen( h :+ t.head, None)

case e @ CaseWhen(branches, Some(elseValue)) if branches
.forall(_._2.semanticEquals(elseValue)) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code style nit:

case e @ CaseWhen(branches, Some(elseValue))
    if branches.forall(_._2.semanticEquals(elseValue))

case e @ CaseWhen(branches, Some(elseValue)) if branches
.forall(_._2.semanticEquals(elseValue)) =>
// For non-deterministic conditions with side effect, we can not remove it, or change
// the ordering. As a result, we try to remove the deterministic conditions from the tail.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more readable to write java style code here

var hitNonDetermin = false
var i = branches.length 
while (i > 0 && !hitNonDetermin) {
  hitNonDetermin = !branches(i - 1).deterministic
  i -= 1
}
if (i == 0) {
  elseValue
} else {
  e.copy(branches = branches.take(i))
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be

  hitNonDetermin = !branches(i - 1).deterministic
  if (!hitNonDetermin) {
    i -= 1
  }

Personally, I like functional style more, but it's more efficient to use Java style here. I updated as you suggested.

@viirya
Copy link
Member

viirya commented Jul 31, 2018

LGTM

@SparkQA
Copy link

SparkQA commented Aug 1, 2018

Test build #93845 has finished for PR 21852 at commit 65fb8c2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 1, 2018

Test build #93847 has finished for PR 21852 at commit b3d5a4a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

var i = branches.length
while (i > 0 && !hitNonDeterministicCond) {
hitNonDeterministicCond = !branches(i - 1)._1.deterministic
if (!hitNonDeterministicCond) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can avoid this per-iteration if check by updating the final step

if (i == 0 && !hitNonDeterministicCond) {
  elseValue
} else {
  e.copy(branches = branches.take(i + 1).map(branch => (branch._1, elseValue)))
}

feel free to change it in your next PR.

@asfgit asfgit closed this in 5f3441e Aug 1, 2018
@wangyum
Copy link
Member

wangyum commented Nov 6, 2020

It seems we simplified non-deterministic expressions with aliases. for example:

CREATE TABLE t(a int, b int, c int) using parquet
SELECT CASE                          
    WHEN rand(100) > 1 THEN 1        
    WHEN rand(100) + 1 > 1000 THEN 1 
    WHEN rand(100) + 2 < 100 THEN 1  
    ELSE 1                           
END AS x                             
FROM t                                                        

The plan is:

== Physical Plan ==
*(1) Project [CASE WHEN (rand(100) > 1.0) THEN 1 WHEN ((rand(100) + 1.0) > 1000.0) THEN 1 WHEN ((rand(100) + 2.0) < 100.0) THEN 1 ELSE 1 END AS x#6]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t[] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/sql/core/spark-warehouse/org.apache.spark...., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>

SELECT CASE                                  
    WHEN rd > 1 THEN 1                       
    WHEN rd + 1 > 1000 THEN 1                
    WHEN rd + 2 < 100 THEN 1                 
    ELSE 1                                   
END AS x                                     
FROM (SELECT *, rand(100) as rd FROM t) t1                                   

The plan is:

== Physical Plan ==
*(1) Project [1 AS x#1]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t[] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/sql/core/spark-warehouse/org.apache.spark...., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>

@HyukjinKwon
Copy link
Member

Hm, seems like it'd be a correctness bug. @wangyum would you mind filing a JIRA and set it as a blocker?

@cloud-fan
Copy link
Contributor

@wangyum this looks correct. After SELECT *, rand(100) as rd FROM t, the output column rd is deterministic, as the result of rand(100) is materialized. We can treat it like writing SELECT *, rand(100) as rd FROM t to a table and read back.

However, I do see a problem:

scala> sql("select 1 FROM (SELECT *, rand(100) as rd FROM t) t1").explain
== Physical Plan ==
*(1) Project [1 AS 1#4]
+- FileScan json default.t[] Batched: false, DataFilters: [], Format: JSON, Location: InMemoryFileIndex[file:/Users/cloud0fan/dev/spark/assembly/spark-warehouse/t], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>

Seems we shouldn't eliminate the project with rand(100). But it's not related to this PR.

@viirya
Copy link
Member

viirya commented Nov 6, 2020

I agree with @cloud-fan. The rd in CaseWhen is deterministic already. If it is embeded in CaseWhen, this PR should not remove it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants