Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Jun 1, 2021

What changes were proposed in this pull request?

This patch implements canonicalized method for HigherOrderFunction. Basically it canonicalizes the name of all NamedLambdaVariables and their ExprId. The name and ExprId of NamedLambdaVariable are unque. But to compare semantic equality between HigherOrderFunction, we can canonicalize them.

Why are the changes needed?

The default canonicalized method does not work for HigherOrderFunction. It makes subexpression elimination not work for higher functions.

Manual check gen-ed code for:

val df = Seq(Seq(1, 2, 3)).toDF("a")
df.select(transform($"a", x => x + 1), transform($"a", x => x + 1)).collect()

The code for transform(input[0, array<int>, true], lambdafunction((lambda x_20#19041 + 1), lambda x_20#19041, false)),transform(input[0, array<int>, true], lambdafunction((lambda x_21#19042 + 1), lambda x_21#19042, false)), generated by GenerateUnsafeProjection.

Before:

/* 005 */ class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
...
/* 028 */   public UnsafeRow apply(InternalRow i) {                                   
...
/* 034 */     Object obj_0 = ((Expression) references[0]).eval(i);                    
...
/* 062 */     Object obj_1 = ((Expression) references[1]).eval(i);                  
...
/* 093 */ }  

After:

/* 005 */ class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
...
/* 031 */   public UnsafeRow apply(InternalRow i) {
...
/* 033 */     subExpr_0(i);                                                           
...
/* 086 */   private void subExpr_0(InternalRow i) {         
/* 087 */     Object obj_0 = ((Expression) references[0]).eval(i);           
/* 088 */     boolean isNull_0 = obj_0 == null;                                       
/* 089 */     ArrayData value_0 = null;
/* 090 */     if (!isNull_0) {
/* 091 */       value_0 = (ArrayData) obj_0;
/* 092 */     }
/* 093 */     subExprIsNull_0 = isNull_0;
/* 094 */     mutableStateArray_0[0] = value_0;
/* 095 */   }
/* 096 */
/* 097 */ }

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test and manual check gen-ed code.

@github-actions github-actions bot added the SQL label Jun 1, 2021
@SparkQA
Copy link

SparkQA commented Jun 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43702/

assert(!sort1_1.semanticEquals(sort1_3))
}

test("semanticEquals between MapFilter") {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I don't add semanticEquals test for all higher functions, but just a few ones you can see. It might be too verbose for adding all, but let me know if you prefer to.

@SparkQA
Copy link

SparkQA commented Jun 1, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43702/

@SparkQA

This comment has been minimized.

Comment on lines +1137 to +1139
HashedRelation(rows, key, numRows.toInt, isNullAware = isNullAware)
case None =>
HashedRelation(rows, canonicalized.key, isNullAware = isNullAware)
HashedRelation(rows, key, isNullAware = isNullAware)
Copy link
Member Author

@viirya viirya Jun 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we use canonicalized key here. We don't do comparison but use the key to project key rows later.

@SparkQA
Copy link

SparkQA commented Jun 2, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43729/

@SparkQA
Copy link

SparkQA commented Jun 2, 2021

Test build #139206 has finished for PR 32735 at commit 4bc9d7c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jun 2, 2021

cc @cloud-fan @maropu @dongjoon-hyun

val argumentMap = functions.flatMap(_.collect {
case l: NamedLambdaVariable =>
currExprId += 1
l.name -> currExprId
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure the name is unique? It looks safer to use l.exprId as the key.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After #32424, I think it should be unique, but I agree that l.exprId is safer.

@SparkQA
Copy link

SparkQA commented Jun 3, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43787/

@SparkQA
Copy link

SparkQA commented Jun 3, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43787/

@SparkQA
Copy link

SparkQA commented Jun 3, 2021

Test build #139263 has finished for PR 32735 at commit a4e13b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jun 3, 2021

Thanks! Merging to master.

@viirya viirya closed this in 0342dcb Jun 3, 2021
@viirya viirya deleted the higher-func-canonicalize branch June 3, 2021 16:17
@dongjoon-hyun
Copy link
Member

+1, late LGTM. Thank you, @viirya and @cloud-fan .

@viirya
Copy link
Member Author

viirya commented Jun 4, 2021

Thank you @dongjoon-hyun!

@HyukjinKwon
Copy link
Member

LGTM2

Kimahriman pushed a commit to Kimahriman/spark that referenced this pull request Feb 22, 2022
…tion

### What changes were proposed in this pull request?

This patch implements `canonicalized` method for `HigherOrderFunction`. Basically it canonicalizes the name of all `NamedLambdaVariable`s and their `ExprId`. The name and `ExprId` of `NamedLambdaVariable` are unque. But to compare semantic equality between `HigherOrderFunction`, we can canonicalize them.

### Why are the changes needed?

The default `canonicalized` method does not work for `HigherOrderFunction`. It makes subexpression elimination not work for higher functions.

Manual check gen-ed code for:
```scala
val df = Seq(Seq(1, 2, 3)).toDF("a")
df.select(transform($"a", x => x + 1), transform($"a", x => x + 1)).collect()
```

The code for `transform(input[0, array<int>, true], lambdafunction((lambda x_20#19041 + 1), lambda x_20#19041, false)),transform(input[0, array<int>, true], lambdafunction((lambda x_21#19042 + 1), lambda x_21#19042, false))`, generated by `GenerateUnsafeProjection`.

Before:

```java
/* 005 */ class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
...
/* 028 */   public UnsafeRow apply(InternalRow i) {
...
/* 034 */     Object obj_0 = ((Expression) references[0]).eval(i);
...
/* 062 */     Object obj_1 = ((Expression) references[1]).eval(i);
...
/* 093 */ }
```

After:
```java
/* 005 */ class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
...
/* 031 */   public UnsafeRow apply(InternalRow i) {
...
/* 033 */     subExpr_0(i);
...
/* 086 */   private void subExpr_0(InternalRow i) {
/* 087 */     Object obj_0 = ((Expression) references[0]).eval(i);
/* 088 */     boolean isNull_0 = obj_0 == null;
/* 089 */     ArrayData value_0 = null;
/* 090 */     if (!isNull_0) {
/* 091 */       value_0 = (ArrayData) obj_0;
/* 092 */     }
/* 093 */     subExprIsNull_0 = isNull_0;
/* 094 */     mutableStateArray_0[0] = value_0;
/* 095 */   }
/* 096 */
/* 097 */ }
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test and manual check gen-ed code.

Closes apache#32735 from viirya/higher-func-canonicalize.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants