-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-28441][SQL][Python] Fix error when non-foldable expression is used in correlated scalar subquery #25204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #107915 has finished for PR 25204 at commit
|
|
Looks making sense to me from a cursory look. I will take a closer look if this doesn't get merged or reviewed. cc @cloud-fan too. |
| } | ||
| Option(rewrittenExpr.eval()) | ||
| if (rewrittenExpr.find(_.isInstanceOf[PythonUDF]).isDefined) { | ||
| // SPARK-28441: `PythonUDF` can't be statically evaluated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
many expressions can't be statically evaluated, why only special-case python udf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue was found by PythonUDF. I think of covering all unevaluable expressions here, but not sure if it is too aggressive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should cover all unevaluable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you define "can't be statically evaluated"? Do you mean !expr.foldable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PythonUDF is Unevaluable. So you can't call eval on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here it fakes the empty input case, and evaluate the expressions in subquery. So it doesn't require foldable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't call Expression.eval(null) if it's not foldable, otherwise exception may be thrown:
AttributeReference.eval(null)fails with NPENondeterministic.eval(null)fails because it needs to be initialized first
Whatever hack we use, I'd expect it makes the expression foldable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 1, AttributeReference was replaced with pre-evaluated value, if it comes from aggregate function. It uses default value. It fakes empty input case. Or null, if it is not.
For 2, I think it is potential issue.
Yeah, here the hack looks like foldable expression. It simulates empty input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems necessary to me to check foldable before calling .eval(), otherwise there is no guarantee that .eval() can success.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Use foldable here to check.
| } | ||
| } | ||
| Option(rewrittenExpr.eval()) | ||
| if (!rewrittenExpr.foldable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we apply the check in more places? evalAggOnZeroTups also calls eval() directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. this is not possible for PythonUDF, but it is potential for other not foldable expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it is not covered by added test. Let me add test for it...
| .asInstanceOf[Boolean] | ||
| if (exprResult) bindings else Map.empty | ||
| } | ||
| evalPlan(child) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we evaluate the filter condition?
|
Test build #108005 has finished for PR 25204 at commit
|
| Option(rewrittenExpr.eval()) | ||
|
|
||
| // Removes Alias over given expression, because Alias is not foldable. | ||
| if (!removeAlias(rewrittenExpr).foldable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like we can move the following code into a common method?
| } else { | ||
| val exprVal = rewrittenExpr.eval() | ||
| if (exprVal == null) { | ||
| None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why we need to return None here instead of a null literal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it uses None to make checking bindings easier.
In other way, to use null literal, Option[Expression] can be changed to Expression in methods like evalSubqueryOnZeroTups, evalPlan. Then we check bindings by literal instead of None. Good thing is we can write Literal.create(rewrittenExpr.eval(), expr.dataType), instead of checking null. Looks like just a choice problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tryEvalExpr?
| bindings | ||
| } else { | ||
| val bindExpr = bindingExpr(condition, bindings) | ||
| .getOrElse(Literal.create(false, BooleanType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For filter condition, null is the same as false. This is one place that makes me think bindingExpr should return Option[Expression].
If this is the only place, I think it's simpler to always return expression, and handle null especially here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I may try this way tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this works. Looks good as it's simple.
| * This replaces original expression id used in attributes and aliases in expression. | ||
| */ | ||
| private def replaceOldExprId( | ||
| orgExprId: ExprId, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
orgExprId -> oldExprId ?
| // We replace original expression id with a new one. The added Alias column | ||
| // must use expr id of original output. If we don't replace old expr id in the | ||
| // query, the added Project in potential Project-Filter-Project can be removed | ||
| // by removeProjectBeforeFilter in ColumnPruning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we fix removeProjectBeforeFilter to only remove attribute-only projects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth trying, right now not sure if any other thing will be affected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried locally. Added subquery tests are passed. We can see if Jenkins passes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Seems fine. Jenkins passes.
|
Test build #108059 has finished for PR 25204 at commit
|
|
Test build #108057 has finished for PR 25204 at commit
|
|
Test build #108109 has finished for PR 25204 at commit
|
|
Test build #108112 has finished for PR 25204 at commit
|
| /** | ||
| * This replaces original expression id used in attributes and aliases in expression. | ||
| */ | ||
| private def replaceOldExprId( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove this.
| Row(1) :: Row(1) ::Row(null) :: Row(null) :: Row(6) :: Nil) | ||
| } | ||
|
|
||
| test("SPARK-28441: COUNT bug in subquery in subquery in subquery with non-foldable expr") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| if p2.outputSet.subsetOf(child.outputSet) => | ||
| if p2.outputSet.subsetOf(child.outputSet) && | ||
| // We only remove attribute-only project. | ||
| p2.projectList.forall(_.isInstanceOf[AttributeReference]) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this change. This may cause serious perf regression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we remove project that's not attribute-only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say it was wrong previously, but if a project's output has same expr IDs with its child, it's usually attribute-only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmmmh... I may be missing something, but I'd imagine a case like this:
select a, b from
(select a, b, very_expensive_operation as c from ... where a = 1)
Before this change, would be optimized as:
select a, b from
(select a, b from ... where a = 1)
while after it is not. Am I wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In above case, it has a Alias in project list, so it's not an attribute-only project. And I think it also create new attr c, so p2.outputSet.subsetOf(child.outputSet) is not met too.
I think the rules in ColumnPruning will trim very_expensive_operation in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now, sorry. Why do we need this? Seems an unrelated change to the fix in this PR, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, the issue was seen in previous comment 33441a3. It was overwritten now.
We added a column for count bug. The column checks a always-true leading column alwaysTrueExpr, returns special value if alwaysTrueExpr is null, to simulate empty input case.
This column reuses expr id of original output in the subquery. In non-foldable expression case, the added column in a potential Project-Filter-Project, will be trimmed by removeProjectBeforeFilter, because the second project meets p2.outputSet.subsetOf(child.outputSet).
My original fix is to create an expr id. Replace original expr id with new one in the subquery. Looks complicated. This seems a simple fix, and looks reasonable.
| newExpression.asInstanceOf[E] | ||
| } | ||
|
|
||
| private def removeAlias(expr: Expression): Expression = expr match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if there are several aliases? Shall we use CleanupAliases instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We track expressions from aggregate expressions as root. I think aliases should be continuous on top. Using CleanupAliases is also good, at least we don't need adding new method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sorry, this is recursive too, but I think it is good to avoid a new method. Thanks.
| checkAnswer( | ||
| sql("select l.a from l where " + | ||
| "(select case when udf(count(*)) = 1 then null else udf(count(*)) end as cnt " + | ||
| "from r where l.a = r.c) = 0"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use multi-line string to write long SQL? Let's also upper case the keywords.
| """ | ||
| |select l.b, (select (r.c + udf(count(*))) is null | ||
| |from r | ||
| |where l.a = r.c group by r.c) from l |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's format the SQL in a more readable way. For this particular example
select
l.b,
(
select (r.c + udf(count(*))) is null
from r
where l.a = r.c
group by r.c
)
from l
|
the fix looks good, some comments about the tests. Thanks for catching and fixing this nasty bug! |
| registerTestUDF(pythonTestUDF, spark) | ||
|
|
||
| checkAnswer( | ||
| sql("""SELECT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: AFAIK the multi-line string should be written as
"""
|line1
|line2
"""
not
"""line1
|line2
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
Test build #108165 has finished for PR 25204 at commit
|
|
Test build #108166 has finished for PR 25204 at commit
|
|
Test build #108175 has finished for PR 25204 at commit
|
| |FROM l WHERE | ||
| | ( | ||
| | SELECT udf(count(*)) + udf(sum(r.d) | ||
| | ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation is wrong, it should be
WHERE
(
SELECT udf(count(*)) + udf(sum(r.d))
FROM r WHERE l.a = r.c
) = 0
| val df = sql(""" | ||
| |SELECT | ||
| | l.a | ||
| | FROM l WHERE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no indentation here.
| sql(""" | ||
| |SELECT l.a FROM l | ||
| |WHERE ( | ||
| | SELECT cntPlusOne + 1 AS cntPlusTwo FROM ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we be consistent with 2 space indentation?
|
@cloud-fan thanks for identifying the style issue. Fixed. |
|
Test build #108215 has finished for PR 25204 at commit
|
|
thanks, merging to master! |
|
thanks @cloud-fan @HyukjinKwon @mgaido91 |
| import IntegratedUDFTestUtils._ | ||
|
|
||
| val pythonTestUDF = TestPythonUDF(name = "udf") | ||
| registerTestUDF(pythonTestUDF, spark) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, we should add assume(shouldTestPythonUDFs). Maybe it's not a biggie in general but it can matter in other venders' testing base. For instance, if somebody launches a test in a minimal docker image, it might make the tests failed suddenly.
This skipping stuff isn't completely new in our test base. See TestUtils.testCommandAvailable for instance.
|
@huaxingao, can you make a followup of SPARK-28277 to re-enable? spark/sql/core/src/test/resources/sql-tests/inputs/udf/udf-except.sql Lines 47 to 59 in cd676e9
|
|
LGTM too |
What changes were proposed in this pull request?
In SPARK-15370, We checked the expression at the root of the correlated subquery, in order to fix count bug. If a
PythonUDFin in the checking path, evaluating it causes the failure as we can't statically evaluatePythonUDF. The Python UDF test added at SPARK-28277 shows this issue.If we can statically evaluate the expression, we intercept NULL values coming from the outer join and replace them with the value that the subquery's expression like before, if it is not, we replace them with the
PythonUDFexpression, with statically evaluated parameters.After this, the last query in
udf-except.sqlwhich throwsjava.lang.UnsupportedOperationExceptioncan be run:Note that this issue is also for other non-foldable expressions, like rand. As like PythonUDF, we can't call
evalon this kind of expressions in optimization. The evaluation needs to defer to query runtime.How was this patch tested?
Added tests.