-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-35580][SQL] Implement canonicalized method for HigherOrderFunction #32735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Kubernetes integration test starting |
| assert(!sort1_1.semanticEquals(sort1_3)) | ||
| } | ||
|
|
||
| test("semanticEquals between MapFilter") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I don't add semanticEquals test for all higher functions, but just a few ones you can see. It might be too verbose for adding all, but let me know if you prefer to.
|
Kubernetes integration test status success |
This comment has been minimized.
This comment has been minimized.
| HashedRelation(rows, key, numRows.toInt, isNullAware = isNullAware) | ||
| case None => | ||
| HashedRelation(rows, canonicalized.key, isNullAware = isNullAware) | ||
| HashedRelation(rows, key, isNullAware = isNullAware) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we use canonicalized key here. We don't do comparison but use the key to project key rows later.
|
Kubernetes integration test status success |
|
Test build #139206 has finished for PR 32735 at commit
|
| val argumentMap = functions.flatMap(_.collect { | ||
| case l: NamedLambdaVariable => | ||
| currExprId += 1 | ||
| l.name -> currExprId |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure the name is unique? It looks safer to use l.exprId as the key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After #32424, I think it should be unique, but I agree that l.exprId is safer.
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #139263 has finished for PR 32735 at commit
|
|
Thanks! Merging to master. |
|
+1, late LGTM. Thank you, @viirya and @cloud-fan . |
|
Thank you @dongjoon-hyun! |
|
LGTM2 |
…tion
### What changes were proposed in this pull request?
This patch implements `canonicalized` method for `HigherOrderFunction`. Basically it canonicalizes the name of all `NamedLambdaVariable`s and their `ExprId`. The name and `ExprId` of `NamedLambdaVariable` are unque. But to compare semantic equality between `HigherOrderFunction`, we can canonicalize them.
### Why are the changes needed?
The default `canonicalized` method does not work for `HigherOrderFunction`. It makes subexpression elimination not work for higher functions.
Manual check gen-ed code for:
```scala
val df = Seq(Seq(1, 2, 3)).toDF("a")
df.select(transform($"a", x => x + 1), transform($"a", x => x + 1)).collect()
```
The code for `transform(input[0, array<int>, true], lambdafunction((lambda x_20#19041 + 1), lambda x_20#19041, false)),transform(input[0, array<int>, true], lambdafunction((lambda x_21#19042 + 1), lambda x_21#19042, false))`, generated by `GenerateUnsafeProjection`.
Before:
```java
/* 005 */ class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
...
/* 028 */ public UnsafeRow apply(InternalRow i) {
...
/* 034 */ Object obj_0 = ((Expression) references[0]).eval(i);
...
/* 062 */ Object obj_1 = ((Expression) references[1]).eval(i);
...
/* 093 */ }
```
After:
```java
/* 005 */ class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
...
/* 031 */ public UnsafeRow apply(InternalRow i) {
...
/* 033 */ subExpr_0(i);
...
/* 086 */ private void subExpr_0(InternalRow i) {
/* 087 */ Object obj_0 = ((Expression) references[0]).eval(i);
/* 088 */ boolean isNull_0 = obj_0 == null;
/* 089 */ ArrayData value_0 = null;
/* 090 */ if (!isNull_0) {
/* 091 */ value_0 = (ArrayData) obj_0;
/* 092 */ }
/* 093 */ subExprIsNull_0 = isNull_0;
/* 094 */ mutableStateArray_0[0] = value_0;
/* 095 */ }
/* 096 */
/* 097 */ }
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test and manual check gen-ed code.
Closes apache#32735 from viirya/higher-func-canonicalize.
Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
What changes were proposed in this pull request?
This patch implements
canonicalizedmethod forHigherOrderFunction. Basically it canonicalizes the name of allNamedLambdaVariables and theirExprId. The name andExprIdofNamedLambdaVariableare unque. But to compare semantic equality betweenHigherOrderFunction, we can canonicalize them.Why are the changes needed?
The default
canonicalizedmethod does not work forHigherOrderFunction. It makes subexpression elimination not work for higher functions.Manual check gen-ed code for:
The code for
transform(input[0, array<int>, true], lambdafunction((lambda x_20#19041 + 1), lambda x_20#19041, false)),transform(input[0, array<int>, true], lambdafunction((lambda x_21#19042 + 1), lambda x_21#19042, false)), generated byGenerateUnsafeProjection.Before:
After:
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit test and manual check gen-ed code.