[SPARK-28441][SQL][Python] Fix error when non-foldable expression is used in correlated scalar subquery #25204

viirya · 2019-07-19T14:36:44Z

What changes were proposed in this pull request?

In SPARK-15370, We checked the expression at the root of the correlated subquery, in order to fix count bug. If a PythonUDF in in the checking path, evaluating it causes the failure as we can't statically evaluate PythonUDF. The Python UDF test added at SPARK-28277 shows this issue.

If we can statically evaluate the expression, we intercept NULL values coming from the outer join and replace them with the value that the subquery's expression like before, if it is not, we replace them with the PythonUDF expression, with statically evaluated parameters.

After this, the last query in udf-except.sql which throws java.lang.UnsupportedOperationException can be run:

SELECT t1.k
FROM   t1
WHERE  t1.v <= (SELECT   udf(max(udf(t2.v)))
                FROM     t2
                WHERE    udf(t2.k) = udf(t1.k))
MINUS
SELECT t1.k
FROM   t1
WHERE  udf(t1.v) >= (SELECT   min(udf(t2.v))
                FROM     t2
                WHERE    t2.k = t1.k)
-- !query 2 schema
struct<k:string>
-- !query 2 output
two

Note that this issue is also for other non-foldable expressions, like rand. As like PythonUDF, we can't call eval on this kind of expressions in optimization. The evaluation needs to defer to query runtime.

How was this patch tested?

Added tests.

SparkQA · 2019-07-19T19:00:11Z

Test build #107915 has finished for PR 25204 at commit 725304c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-07-20T07:18:47Z

Looks making sense to me from a cursory look. I will take a closer look if this doesn't get merged or reviewed. cc @cloud-fan too.

cloud-fan · 2019-07-22T02:23:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

    }
-    Option(rewrittenExpr.eval())
+    if (rewrittenExpr.find(_.isInstanceOf[PythonUDF]).isDefined) {
+      // SPARK-28441: `PythonUDF` can't be statically evaluated.


many expressions can't be statically evaluated, why only special-case python udf?

This issue was found by PythonUDF. I think of covering all unevaluable expressions here, but not sure if it is too aggressive.

Do you think we should cover all unevaluable?

How do you define "can't be statically evaluated"? Do you mean !expr.foldable?

PythonUDF is Unevaluable. So you can't call eval on it.

here it fakes the empty input case, and evaluate the expressions in subquery. So it doesn't require foldable.

We can't call Expression.eval(null) if it's not foldable, otherwise exception may be thrown:

AttributeReference.eval(null) fails with NPE

Nondeterministic.eval(null) fails because it needs to be initialized first

Whatever hack we use, I'd expect it makes the expression foldable.

For 1, AttributeReference was replaced with pre-evaluated value, if it comes from aggregate function. It uses default value. It fakes empty input case. Or null, if it is not.

For 2, I think it is potential issue.

Yeah, here the hack looks like foldable expression. It simulates empty input.

It seems necessary to me to check foldable before calling .eval(), otherwise there is no guarantee that .eval() can success.

Good point. Use foldable here to check.

cloud-fan · 2019-07-22T09:37:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

        }
    }
-    Option(rewrittenExpr.eval())
+    if (!rewrittenExpr.foldable) {


shall we apply the check in more places? evalAggOnZeroTups also calls eval() directly.

yes. this is not possible for PythonUDF, but it is potential for other not foldable expression.

so it is not covered by added test. Let me add test for it...

cloud-fan · 2019-07-22T09:40:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

-            .asInstanceOf[Boolean]
-          if (exprResult) bindings else Map.empty
-        }
+        evalPlan(child)


shouldn't we evaluate the filter condition?

SparkQA · 2019-07-22T13:00:33Z

Test build #108005 has finished for PR 25204 at commit 7972d7c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-07-23T14:21:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

-    Option(rewrittenExpr.eval())
+
+    // Removes Alias over given expression, because Alias is not foldable.
+    if (!removeAlias(rewrittenExpr).foldable) {


seems like we can move the following code into a common method?

cloud-fan · 2019-07-23T14:27:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

+    } else {
+      val exprVal = rewrittenExpr.eval()
+      if (exprVal == null) {
+        None


Do you know why we need to return None here instead of a null literal?

I think it uses None to make checking bindings easier.

In other way, to use null literal, Option[Expression] can be changed to Expression in methods like evalSubqueryOnZeroTups, evalPlan. Then we check bindings by literal instead of None. Good thing is we can write Literal.create(rewrittenExpr.eval(), expr.dataType), instead of checking null. Looks like just a choice problem.

cloud-fan · 2019-07-23T14:58:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

tryEvalExpr?

cloud-fan · 2019-07-23T15:01:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

+          bindings
+        } else {
+          val bindExpr = bindingExpr(condition, bindings)
+            .getOrElse(Literal.create(false, BooleanType))


For filter condition, null is the same as false. This is one place that makes me think bindingExpr should return Option[Expression].

If this is the only place, I think it's simpler to always return expression, and handle null especially here.

Ok. I may try this way tomorrow.

Yeah, this works. Looks good as it's simple.

cloud-fan · 2019-07-23T15:04:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

+   * This replaces original expression id used in attributes and aliases in expression.
+   */
+  private def replaceOldExprId(
+      orgExprId: ExprId,


orgExprId -> oldExprId ?

cloud-fan · 2019-07-23T15:12:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

+            // We replace original expression id with a new one. The added Alias column
+            // must use expr id of original output. If we don't replace old expr id in the
+            // query, the added Project in potential Project-Filter-Project can be removed
+            // by removeProjectBeforeFilter in ColumnPruning.


shall we fix removeProjectBeforeFilter to only remove attribute-only projects?

Worth trying, right now not sure if any other thing will be affected.

Tried locally. Added subquery tests are passed. We can see if Jenkins passes.

Ok. Seems fine. Jenkins passes.

SparkQA · 2019-07-23T18:22:48Z

Test build #108059 has finished for PR 25204 at commit 110a39e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-23T18:27:45Z

Test build #108057 has finished for PR 25204 at commit 33441a3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-24T20:20:33Z

Test build #108109 has finished for PR 25204 at commit 0158d85.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-24T20:53:08Z

Test build #108112 has finished for PR 25204 at commit 2dd29c1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-07-25T08:46:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

+  /**
+   * This replaces original expression id used in attributes and aliases in expression.
+   */
+  private def replaceOldExprId(


we can remove this.

cloud-fan · 2019-07-25T08:50:56Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

+      Row(1) :: Row(1) ::Row(null) :: Row(null) :: Row(6) :: Nil)
+  }
+
+  test("SPARK-28441: COUNT bug in subquery in subquery in subquery with non-foldable expr") {


mgaido91 · 2019-07-25T08:50:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

-      if p2.outputSet.subsetOf(child.outputSet) =>
+      if p2.outputSet.subsetOf(child.outputSet) &&
+        // We only remove attribute-only project.
+        p2.projectList.forall(_.isInstanceOf[AttributeReference]) =>


I am not sure about this change. This may cause serious perf regression

How can we remove project that's not attribute-only?

I'd say it was wrong previously, but if a project's output has same expr IDs with its child, it's usually attribute-only.

Mmmmh... I may be missing something, but I'd imagine a case like this:

select a, b from (select a, b, very_expensive_operation as c from ... where a = 1)

Before this change, would be optimized as:

select a, b from (select a, b from ... where a = 1)

while after it is not. Am I wrong?

In above case, it has a Alias in project list, so it's not an attribute-only project. And I think it also create new attr c, so p2.outputSet.subsetOf(child.outputSet) is not met too.

I think the rules in ColumnPruning will trim very_expensive_operation in the end.

I see now, sorry. Why do we need this? Seems an unrelated change to the fix in this PR, isn't it?

oh, the issue was seen in previous comment 33441a3. It was overwritten now.

We added a column for count bug. The column checks a always-true leading column alwaysTrueExpr, returns special value if alwaysTrueExpr is null, to simulate empty input case.

This column reuses expr id of original output in the subquery. In non-foldable expression case, the added column in a potential Project-Filter-Project, will be trimmed by removeProjectBeforeFilter, because the second project meets p2.outputSet.subsetOf(child.outputSet).

My original fix is to create an expr id. Replace original expr id with new one in the subquery. Looks complicated. This seems a simple fix, and looks reasonable.

mgaido91 · 2019-07-25T08:51:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

    newExpression.asInstanceOf[E]
  }

+  private def removeAlias(expr: Expression): Expression = expr match {


what if there are several aliases? Shall we use CleanupAliases instead?

We track expressions from aggregate expressions as root. I think aliases should be continuous on top. Using CleanupAliases is also good, at least we don't need adding new method.

yes, sorry, this is recursive too, but I think it is good to avoid a new method. Thanks.

cloud-fan · 2019-07-25T08:51:55Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

+    checkAnswer(
+      sql("select l.a from l where " +
+        "(select case when udf(count(*)) = 1 then null else udf(count(*)) end as cnt " +
+        "from r where l.a = r.c) = 0"),


can we use multi-line string to write long SQL? Let's also upper case the keywords.

cloud-fan · 2019-07-25T08:53:53Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

+        """
+          |select l.b, (select (r.c + udf(count(*))) is null
+          |from r
+          |where l.a = r.c group by r.c) from l


let's format the SQL in a more readable way. For this particular example

select l.b, ( select (r.c + udf(count(*))) is null from r where l.a = r.c group by r.c ) from l

cloud-fan · 2019-07-25T08:54:45Z

the fix looks good, some comments about the tests. Thanks for catching and fixing this nasty bug!

cloud-fan · 2019-07-25T10:49:58Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

+    registerTestUDF(pythonTestUDF, spark)
+
+    checkAnswer(
+      sql("""SELECT


nit: AFAIK the multi-line string should be written as

""" |line1 |line2 """

not

"""line1 |line2 """

SparkQA · 2019-07-25T14:39:18Z

Test build #108165 has finished for PR 25204 at commit 9aea844.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-25T14:55:57Z

Test build #108166 has finished for PR 25204 at commit 1f6b717.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-25T19:27:46Z

Test build #108175 has finished for PR 25204 at commit d7d023d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-07-26T02:08:58Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

+            |FROM l WHERE
+            |  (
+            |    SELECT udf(count(*)) + udf(sum(r.d)
+            |  )


The indentation is wrong, it should be

WHERE ( SELECT udf(count(*)) + udf(sum(r.d)) FROM r WHERE l.a = r.c ) = 0

cloud-fan · 2019-07-26T02:09:44Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

+    val df = sql("""
+                   |SELECT
+                   |  l.a
+                   |  FROM l WHERE


no indentation here.

cloud-fan · 2019-07-26T02:10:10Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

+      sql("""
+            |SELECT l.a FROM l
+            |WHERE (
+            |    SELECT cntPlusOne + 1 AS cntPlusTwo FROM (


shall we be consistent with 2 space indentation?

viirya · 2019-07-26T13:29:58Z

@cloud-fan thanks for identifying the style issue. Fixed.

SparkQA · 2019-07-26T18:51:07Z

Test build #108215 has finished for PR 25204 at commit fd29677.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-07-27T02:39:52Z

thanks, merging to master!

viirya · 2019-07-27T02:51:14Z

thanks @cloud-fan @HyukjinKwon @mgaido91

HyukjinKwon · 2019-07-27T03:32:46Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

+    import IntegratedUDFTestUtils._
+
+    val pythonTestUDF = TestPythonUDF(name = "udf")
+    registerTestUDF(pythonTestUDF, spark)


BTW, we should add assume(shouldTestPythonUDFs). Maybe it's not a biggie in general but it can matter in other venders' testing base. For instance, if somebody launches a test in a minimal docker image, it might make the tests failed suddenly.

This skipping stuff isn't completely new in our test base. See TestUtils.testCommandAvailable for instance.

HyukjinKwon · 2019-07-27T05:41:06Z

@huaxingao, can you make a followup of SPARK-28277 to re-enable?

spark/sql/core/src/test/resources/sql-tests/inputs/udf/udf-except.sql

Lines 47 to 59 in cd676e9

    
           -- Except operation that will be replaced by left anti join 
        
           --- [SPARK-28441] udf(max(udf(column))) throws java.lang.UnsupportedOperationException: Cannot evaluate expression: udf(null) 
        
           --- SELECT t1.k 
        
           --- FROM   t1 
        
           --- WHERE  t1.v <= (SELECT   udf(max(udf(t2.v))) 
        
           ---                 FROM     t2 
        
           ---                 WHERE    udf(t2.k) = udf(t1.k)) 
        
           --- MINUS 
        
           --- SELECT t1.k 
        
           --- FROM   t1 
        
           --- WHERE  udf(t1.v) >= (SELECT   min(udf(t2.v)) 
        
           ---                 FROM     t2 
        
           ---                 WHERE    t2.k = t1.k);

HyukjinKwon · 2019-07-27T05:58:25Z

LGTM too

Fix error when PythonUDF is used in outer plan.

725304c

viirya mentioned this pull request Jul 19, 2019

[SPARK-28277][SQL][PYTHON][TESTS] Convert and port 'except.sql' into UDF test base #25101

Closed

dongjoon-hyun added PYSPARK SQL labels Jul 20, 2019

cloud-fan reviewed Jul 22, 2019

View reviewed changes

Address comment.

7972d7c

cloud-fan reviewed Jul 22, 2019

View reviewed changes

Fix non-foldable expression other than PythonUDF.

33441a3

cloud-fan reviewed Jul 23, 2019

View reviewed changes

viirya changed the title ~~[SPARK-28441][SQL][Python] Fix error when PythonUDF is used in correlated scalar subquery~~ [SPARK-28441][SQL][Python] Fix error when non-foldable expression is used in correlated scalar subquery Jul 23, 2019

cloud-fan reviewed Jul 23, 2019

View reviewed changes

viirya force-pushed the SPARK-28441 branch from 118d6f3 to b2a947e Compare July 23, 2019 14:57

cloud-fan reviewed Jul 23, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala Outdated

Copy link

Contributor

cloud-fan Jul 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tryEvalExpr?

Add a common method.

110a39e

viirya force-pushed the SPARK-28441 branch from b2a947e to 110a39e Compare July 23, 2019 14:59

cloud-fan reviewed Jul 23, 2019

View reviewed changes

viirya added 3 commits July 24, 2019 23:13

Rename method and variable.

a7803f5

Use Expression instead of Option[Expression].

0158d85

Try to only remove attribute-only projects in removeProjectBeforeFilter.

2dd29c1

cloud-fan reviewed Jul 25, 2019

View reviewed changes

mgaido91 reviewed Jul 25, 2019

View reviewed changes

cloud-fan reviewed Jul 25, 2019

View reviewed changes

viirya added 2 commits July 25, 2019 17:57

Cleanup.

9aea844

Better to check result.

1f6b717

cloud-fan reviewed Jul 25, 2019

View reviewed changes

Fix style.

d7d023d

cloud-fan reviewed Jul 26, 2019

View reviewed changes

cloud-fan mentioned this pull request Jul 26, 2019

[SPARK-19712][SQL] Move subquery rewrite to beginning of optimizer #25258

Closed

Fix few styles.

fd29677

cloud-fan approved these changes Jul 26, 2019

View reviewed changes

cloud-fan closed this in 558dd23 Jul 27, 2019

HyukjinKwon reviewed Jul 27, 2019

View reviewed changes

dongjoon-hyun mentioned this pull request Jul 28, 2019

[SPARK-28277][SQL][PYTHON][TESTS][FOLLOW-UP] Re-enable commented out test #25278

Closed

maropu mentioned this pull request Aug 1, 2019

[SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table" #24938

Closed

viirya deleted the SPARK-28441 branch December 27, 2023 18:22

[SPARK-28441][SQL][Python] Fix error when non-foldable expression is used in correlated scalar subquery #25204

[SPARK-28441][SQL][Python] Fix error when non-foldable expression is used in correlated scalar subquery #25204

Uh oh!

Conversation

viirya commented Jul 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 19, 2019

Uh oh!

HyukjinKwon commented Jul 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Jul 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 22, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 23, 2019

Uh oh!

SparkQA commented Jul 23, 2019

Uh oh!

SparkQA commented Jul 24, 2019

Uh oh!

SparkQA commented Jul 24, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Jul 19, 2019 •

edited

Loading

HyukjinKwon commented Jul 20, 2019 •

edited

Loading

viirya Jul 22, 2019 •

edited

Loading

viirya Jul 25, 2019 •

edited

Loading