Skip to content

Conversation

@dmsolow
Copy link

@dmsolow dmsolow commented Mar 19, 2021

What changes were proposed in this pull request?

Increment a global counter and use the current value to name LambdaVariables created by higher order functions.

Why are the changes needed?

This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable)

For this query:

val df = Seq(
    (Seq(1,2,3), Seq("a", "b", "c"))
).toDF("numbers", "letters")

df.select(
    f.flatten(
        f.transform(
            $"numbers",
            (number: Column) => { f.transform(
                $"letters",
                (letter: Column) => { f.struct(
                    number.as("number"),
                    letter.as("letter")
                ) }
            ) }
        )
    ).as("zipped")
).show(10, false)

This is the current (incorrect) output:

+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+

And this is the correct output after fix:

+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+

Does this PR introduce any user-facing change?

No

How was this patch tested?

I'm not sure how to test this because the current tests don't run the functions.transform directly

@nvander1 @RussellSpitzer

@nvander1
Copy link
Contributor

Maybe you can make a test with an example of the shadowing to demonstrate how it works correctly now.

@RussellSpitzer
Copy link
Member

I think your example from slack would be prefect as a test. Especially because of the divergent behavior between Spark SQL and the programmatic api

@github-actions github-actions bot added the SQL label Mar 19, 2021
@dmsolow
Copy link
Author

dmsolow commented Mar 19, 2021

@nvander1 test added

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit, looks good to me other than that

@dmsolow dmsolow changed the title [WIP][SPARK-34794][SQL] fix nested transform issue [SPARK-34794][SQL] fix nested transform issue Mar 22, 2021
@tanelk
Copy link
Contributor

tanelk commented Apr 5, 2021

pinging @maropu, @HyukjinKwon
This correctness related PR has gone unnoticed

@maropu
Copy link
Member

maropu commented Apr 6, 2021

ok to test

@maropu
Copy link
Member

maropu commented Apr 6, 2021

branch-3.0 has the same issue? Also, could you describe more in the PR description? It should be self-contained for commit logs, I think (Its okay to just copy&paste the example query in the jira)

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, @dmsolow !

assert(ex3.getMessage.contains("cannot resolve 'a'"))
}

test("nested transform (DSL)") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: nested transform (DSL) -> SPARK-34794: nested transform

}

// counter to ensure lambda variable names are unique
private val lambdaVarNameCounter = new AtomicInteger(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about defining a companion object for UnresolvedNamedLambdaVariable like this?


object UnresolvedNamedLambdaVariable {

  // counter to ensure lambda variable names are unique
  private val lambdaVarNameCounter = new AtomicInteger(0)

  def apply(args: Seq[String]): UnresolvedNamedLambdaVariable = {
    // add a counter number in the suffix
  }
}

flatten(
transform(
$"numbers",
(number: Column) => transform(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need exhaustive tests for this issue, e.g., two/three argument cases and their combination cases.

@maropu maropu changed the title [SPARK-34794][SQL] fix nested transform issue [SPARK-34794][SQL] Fix nested transform issue Apr 6, 2021
@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41501/

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41501/

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Test build #136924 has finished for PR 31887 at commit 8c2ad61.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Apr 26, 2021

Kindly ping.

@dmsolow
Copy link
Author

dmsolow commented Apr 27, 2021

@maropu I won't be able to work on this until June 1, other duties are getting in the way.

@HyukjinKwon
Copy link
Member

cc @ueshin too FYI

@maropu
Copy link
Member

maropu commented Apr 28, 2021

@maropu I won't be able to work on this until June 1, other duties are getting in the way.

Ah, I see. Is it okay that I take this over?

@dmsolow
Copy link
Author

dmsolow commented Apr 28, 2021

@maropu I won't be able to work on this until June 1, other duties are getting in the way.

Ah, I see. Is it okay that I take this over?

Go ahead.

@maropu maropu closed this in f550e03 May 5, 2021
maropu added a commit that referenced this pull request May 5, 2021
…e functions

### What changes were proposed in this pull request?

To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions.

This is the rework of #31887. Closes #31887.

### Why are the changes needed?

 This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable)

For this query:
```
val df = Seq(
    (Seq(1,2,3), Seq("a", "b", "c"))
).toDF("numbers", "letters")

df.select(
    f.flatten(
        f.transform(
            $"numbers",
            (number: Column) => { f.transform(
                $"letters",
                (letter: Column) => { f.struct(
                    number.as("number"),
                    letter.as("letter")
                ) }
            ) }
        )
    ).as("zipped")
).show(10, false)
```
This is the current (incorrect) output:
```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+
```
And this is the correct output after fix:
```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added the new test in `DataFrameFunctionsSuite`.

Closes #32424 from maropu/pr31887.

Lead-authored-by: dsolow <[email protected]>
Co-authored-by: Takeshi Yamamuro <[email protected]>
Co-authored-by: dmsolow <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
(cherry picked from commit f550e03)
Signed-off-by: Takeshi Yamamuro <[email protected]>
maropu added a commit that referenced this pull request May 5, 2021
…e functions

### What changes were proposed in this pull request?

To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions.

This is the rework of #31887. Closes #31887.

### Why are the changes needed?

 This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable)

For this query:
```
val df = Seq(
    (Seq(1,2,3), Seq("a", "b", "c"))
).toDF("numbers", "letters")

df.select(
    f.flatten(
        f.transform(
            $"numbers",
            (number: Column) => { f.transform(
                $"letters",
                (letter: Column) => { f.struct(
                    number.as("number"),
                    letter.as("letter")
                ) }
            ) }
        )
    ).as("zipped")
).show(10, false)
```
This is the current (incorrect) output:
```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+
```
And this is the correct output after fix:
```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added the new test in `DataFrameFunctionsSuite`.

Closes #32424 from maropu/pr31887.

Lead-authored-by: dsolow <[email protected]>
Co-authored-by: Takeshi Yamamuro <[email protected]>
Co-authored-by: dmsolow <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
(cherry picked from commit f550e03)
Signed-off-by: Takeshi Yamamuro <[email protected]>
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
…e functions

### What changes were proposed in this pull request?

To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions.

This is the rework of apache#31887. Closes apache#31887.

### Why are the changes needed?

 This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable)

For this query:
```
val df = Seq(
    (Seq(1,2,3), Seq("a", "b", "c"))
).toDF("numbers", "letters")

df.select(
    f.flatten(
        f.transform(
            $"numbers",
            (number: Column) => { f.transform(
                $"letters",
                (letter: Column) => { f.struct(
                    number.as("number"),
                    letter.as("letter")
                ) }
            ) }
        )
    ).as("zipped")
).show(10, false)
```
This is the current (incorrect) output:
```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+
```
And this is the correct output after fix:
```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added the new test in `DataFrameFunctionsSuite`.

Closes apache#32424 from maropu/pr31887.

Lead-authored-by: dsolow <[email protected]>
Co-authored-by: Takeshi Yamamuro <[email protected]>
Co-authored-by: dmsolow <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
(cherry picked from commit f550e03)
Signed-off-by: Takeshi Yamamuro <[email protected]>
(cherry picked from commit 6df4ec0)
Signed-off-by: Dongjoon Hyun <[email protected]>
fishcus pushed a commit to fishcus/spark that referenced this pull request Jan 12, 2022
…e functions

### What changes were proposed in this pull request?

To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions.

This is the rework of apache#31887. Closes apache#31887.

### Why are the changes needed?

 This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable)

For this query:
```
val df = Seq(
    (Seq(1,2,3), Seq("a", "b", "c"))
).toDF("numbers", "letters")

df.select(
    f.flatten(
        f.transform(
            $"numbers",
            (number: Column) => { f.transform(
                $"letters",
                (letter: Column) => { f.struct(
                    number.as("number"),
                    letter.as("letter")
                ) }
            ) }
        )
    ).as("zipped")
).show(10, false)
```
This is the current (incorrect) output:
```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+
```
And this is the correct output after fix:
```
+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added the new test in `DataFrameFunctionsSuite`.

Closes apache#32424 from maropu/pr31887.

Lead-authored-by: dsolow <[email protected]>
Co-authored-by: Takeshi Yamamuro <[email protected]>
Co-authored-by: dmsolow <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
(cherry picked from commit f550e03)
Signed-off-by: Takeshi Yamamuro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants