[SPARK-35580][SQL] Implement canonicalized method for HigherOrderFunction #32735

viirya · 2021-06-01T20:25:19Z

What changes were proposed in this pull request?

This patch implements canonicalized method for HigherOrderFunction. Basically it canonicalizes the name of all NamedLambdaVariables and their ExprId. The name and ExprId of NamedLambdaVariable are unque. But to compare semantic equality between HigherOrderFunction, we can canonicalize them.

Why are the changes needed?

The default canonicalized method does not work for HigherOrderFunction. It makes subexpression elimination not work for higher functions.

Manual check gen-ed code for:

val df = Seq(Seq(1, 2, 3)).toDF("a")
df.select(transform($"a", x => x + 1), transform($"a", x => x + 1)).collect()

The code for transform(input[0, array<int>, true], lambdafunction((lambda x_20#19041 + 1), lambda x_20#19041, false)),transform(input[0, array<int>, true], lambdafunction((lambda x_21#19042 + 1), lambda x_21#19042, false)), generated by GenerateUnsafeProjection.

Before:

/* 005 */ class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
...
/* 028 */   public UnsafeRow apply(InternalRow i) {                                   
...
/* 034 */     Object obj_0 = ((Expression) references[0]).eval(i);                    
...
/* 062 */     Object obj_1 = ((Expression) references[1]).eval(i);                  
...
/* 093 */ }

After:

/* 005 */ class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
...
/* 031 */   public UnsafeRow apply(InternalRow i) {
...
/* 033 */     subExpr_0(i);                                                           
...
/* 086 */   private void subExpr_0(InternalRow i) {         
/* 087 */     Object obj_0 = ((Expression) references[0]).eval(i);           
/* 088 */     boolean isNull_0 = obj_0 == null;                                       
/* 089 */     ArrayData value_0 = null;
/* 090 */     if (!isNull_0) {
/* 091 */       value_0 = (ArrayData) obj_0;
/* 092 */     }
/* 093 */     subExprIsNull_0 = isNull_0;
/* 094 */     mutableStateArray_0[0] = value_0;
/* 095 */   }
/* 096 */
/* 097 */ }

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test and manual check gen-ed code.

SparkQA · 2021-06-01T21:35:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43702/

viirya · 2021-06-01T21:38:38Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HigherOrderFunctionsSuite.scala

+    assert(!sort1_1.semanticEquals(sort1_3))
+  }
+
+  test("semanticEquals between MapFilter") {


Note that I don't add semanticEquals test for all higher functions, but just a few ones you can see. It might be too verbose for adding all, but let me know if you prefer to.

SparkQA · 2021-06-01T22:07:31Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43702/

viirya · 2021-06-02T07:26:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala

+        HashedRelation(rows, key, numRows.toInt, isNullAware = isNullAware)
      case None =>
-        HashedRelation(rows, canonicalized.key, isNullAware = isNullAware)
+        HashedRelation(rows, key, isNullAware = isNullAware)


I'm not sure why we use canonicalized key here. We don't do comparison but use the key to project key rows later.

SparkQA · 2021-06-02T09:16:26Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43729/

SparkQA · 2021-06-02T12:07:18Z

Test build #139206 has finished for PR 32735 at commit 4bc9d7c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-06-02T16:26:08Z

cc @cloud-fan @maropu @dongjoon-hyun

cloud-fan · 2021-06-03T02:16:28Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+    val argumentMap = functions.flatMap(_.collect {
+      case l: NamedLambdaVariable =>
+        currExprId += 1
+        l.name -> currExprId


Are you sure the name is unique? It looks safer to use l.exprId as the key.

After #32424, I think it should be unique, but I agree that l.exprId is safer.

SparkQA · 2021-06-03T04:47:12Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43787/

SparkQA · 2021-06-03T05:18:18Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43787/

SparkQA · 2021-06-03T07:22:52Z

Test build #139263 has finished for PR 32735 at commit a4e13b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-06-03T16:16:11Z

Thanks! Merging to master.

dongjoon-hyun · 2021-06-03T21:05:32Z

+1, late LGTM. Thank you, @viirya and @cloud-fan .

viirya · 2021-06-04T00:31:12Z

Thank you @dongjoon-hyun!

HyukjinKwon · 2021-06-04T03:31:15Z

LGTM2

…tion ### What changes were proposed in this pull request? This patch implements `canonicalized` method for `HigherOrderFunction`. Basically it canonicalizes the name of all `NamedLambdaVariable`s and their `ExprId`. The name and `ExprId` of `NamedLambdaVariable` are unque. But to compare semantic equality between `HigherOrderFunction`, we can canonicalize them. ### Why are the changes needed? The default `canonicalized` method does not work for `HigherOrderFunction`. It makes subexpression elimination not work for higher functions. Manual check gen-ed code for: ```scala val df = Seq(Seq(1, 2, 3)).toDF("a") df.select(transform($"a", x => x + 1), transform($"a", x => x + 1)).collect() ``` The code for `transform(input[0, array<int>, true], lambdafunction((lambda x_20#19041 + 1), lambda x_20#19041, false)),transform(input[0, array<int>, true], lambdafunction((lambda x_21#19042 + 1), lambda x_21#19042, false))`, generated by `GenerateUnsafeProjection`. Before: ```java /* 005 */ class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection { ... /* 028 */ public UnsafeRow apply(InternalRow i) { ... /* 034 */ Object obj_0 = ((Expression) references[0]).eval(i); ... /* 062 */ Object obj_1 = ((Expression) references[1]).eval(i); ... /* 093 */ } ``` After: ```java /* 005 */ class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection { ... /* 031 */ public UnsafeRow apply(InternalRow i) { ... /* 033 */ subExpr_0(i); ... /* 086 */ private void subExpr_0(InternalRow i) { /* 087 */ Object obj_0 = ((Expression) references[0]).eval(i); /* 088 */ boolean isNull_0 = obj_0 == null; /* 089 */ ArrayData value_0 = null; /* 090 */ if (!isNull_0) { /* 091 */ value_0 = (ArrayData) obj_0; /* 092 */ } /* 093 */ subExprIsNull_0 = isNull_0; /* 094 */ mutableStateArray_0[0] = value_0; /* 095 */ } /* 096 */ /* 097 */ } ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test and manual check gen-ed code. Closes apache#32735 from viirya/higher-func-canonicalize. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]>

Implement canonicalized for HigherOrderFunction.

bf97ef5

github-actions bot added the SQL label Jun 1, 2021

viirya commented Jun 1, 2021

View reviewed changes

This comment has been minimized.

Sign in to view

Do not use canonicalized to generate key.

4bc9d7c

viirya commented Jun 2, 2021

View reviewed changes

cloud-fan reviewed Jun 3, 2021

View reviewed changes

Use exprId as key instead of name.

a4e13b2

cloud-fan approved these changes Jun 3, 2021

View reviewed changes

viirya closed this in 0342dcb Jun 3, 2021

viirya deleted the higher-func-canonicalize branch June 3, 2021 16:17

[SPARK-35580][SQL] Implement canonicalized method for HigherOrderFunction #32735

[SPARK-35580][SQL] Implement canonicalized method for HigherOrderFunction #32735

Uh oh!

Conversation

viirya commented Jun 1, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jun 1, 2021

Uh oh!

viirya Jun 1, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 1, 2021

Uh oh!

This comment has been minimized.

viirya Jun 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 2, 2021

Uh oh!

SparkQA commented Jun 2, 2021

Uh oh!

viirya commented Jun 2, 2021

Uh oh!

cloud-fan Jun 3, 2021

Choose a reason for hiding this comment

Uh oh!

viirya Jun 3, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 3, 2021

Uh oh!

SparkQA commented Jun 3, 2021

Uh oh!

SparkQA commented Jun 3, 2021

Uh oh!

viirya commented Jun 3, 2021

Uh oh!

dongjoon-hyun commented Jun 3, 2021

Uh oh!

viirya commented Jun 4, 2021

Uh oh!

HyukjinKwon commented Jun 4, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

viirya Jun 2, 2021 •

edited

Loading