[SPARK-34794][SQL] Fix nested transform issue #31887

dmsolow · 2021-03-19T00:07:46Z

What changes were proposed in this pull request?

Increment a global counter and use the current value to name LambdaVariables created by higher order functions.

Why are the changes needed?

This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable)

For this query:

val df = Seq(
    (Seq(1,2,3), Seq("a", "b", "c"))
).toDF("numbers", "letters")

df.select(
    f.flatten(
        f.transform(
            $"numbers",
            (number: Column) => { f.transform(
                $"letters",
                (letter: Column) => { f.struct(
                    number.as("number"),
                    letter.as("letter")
                ) }
            ) }
        )
    ).as("zipped")
).show(10, false)

This is the current (incorrect) output:

+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+

And this is the correct output after fix:

+------------------------------------------------------------------------+
|zipped                                                                  |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+

Does this PR introduce any user-facing change?

No

How was this patch tested?

I'm not sure how to test this because the current tests don't run the functions.transform directly

@nvander1 @RussellSpitzer

nvander1 · 2021-03-19T01:49:13Z

Maybe you can make a test with an example of the shadowing to demonstrate how it works correctly now.

RussellSpitzer · 2021-03-19T01:57:09Z

I think your example from slack would be prefect as a test. Especially because of the divergent behavior between Spark SQL and the programmatic api

dmsolow · 2021-03-19T03:25:11Z

@nvander1 test added

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

RussellSpitzer

small nit, looks good to me other than that

tanelk · 2021-04-05T13:15:02Z

pinging @maropu, @HyukjinKwon
This correctness related PR has gone unnoticed

maropu · 2021-04-06T01:00:53Z

ok to test

maropu · 2021-04-06T01:05:08Z

branch-3.0 has the same issue? Also, could you describe more in the PR description? It should be self-contained for commit logs, I think (Its okay to just copy&paste the example query in the jira)

maropu

Nice catch, @dmsolow !

maropu · 2021-04-06T01:05:59Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

    assert(ex3.getMessage.contains("cannot resolve 'a'"))
  }

+  test("nested transform (DSL)") {


nit: nested transform (DSL) -> SPARK-34794: nested transform

maropu · 2021-04-06T01:08:16Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

  }

+  // counter to ensure lambda variable names are unique
+  private val lambdaVarNameCounter = new AtomicInteger(0)


How about defining a companion object for UnresolvedNamedLambdaVariable like this?

object UnresolvedNamedLambdaVariable { // counter to ensure lambda variable names are unique private val lambdaVarNameCounter = new AtomicInteger(0) def apply(args: Seq[String]): UnresolvedNamedLambdaVariable = { // add a counter number in the suffix } }

maropu · 2021-04-06T01:10:25Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

+          flatten(
+            transform(
+              $"numbers",
+              (number: Column) => transform(


I think we need exhaustive tests for this issue, e.g., two/three argument cases and their combination cases.

SparkQA · 2021-04-06T01:57:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41501/

SparkQA · 2021-04-06T01:57:26Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41501/

SparkQA · 2021-04-06T05:40:23Z

Test build #136924 has finished for PR 31887 at commit 8c2ad61.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2021-04-26T08:05:59Z

Kindly ping.

dmsolow · 2021-04-27T14:15:04Z

@maropu I won't be able to work on this until June 1, other duties are getting in the way.

HyukjinKwon · 2021-04-28T03:09:43Z

cc @ueshin too FYI

maropu · 2021-04-28T03:28:42Z

@maropu I won't be able to work on this until June 1, other duties are getting in the way.

Ah, I see. Is it okay that I take this over?

dmsolow · 2021-04-28T12:14:59Z

@maropu I won't be able to work on this until June 1, other duties are getting in the way.

Ah, I see. Is it okay that I take this over?

Go ahead.

…e functions ### What changes were proposed in this pull request? To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions. This is the rework of #31887. Closes #31887. ### Why are the changes needed? This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable) For this query: ``` val df = Seq( (Seq(1,2,3), Seq("a", "b", "c")) ).toDF("numbers", "letters") df.select( f.flatten( f.transform( $"numbers", (number: Column) => { f.transform( $"letters", (letter: Column) => { f.struct( number.as("number"), letter.as("letter") ) } ) } ) ).as("zipped") ).show(10, false) ``` This is the current (incorrect) output: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` And this is the correct output after fix: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added the new test in `DataFrameFunctionsSuite`. Closes #32424 from maropu/pr31887. Lead-authored-by: dsolow <[email protected]> Co-authored-by: Takeshi Yamamuro <[email protected]> Co-authored-by: dmsolow <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit f550e03) Signed-off-by: Takeshi Yamamuro <[email protected]>

…e functions ### What changes were proposed in this pull request? To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions. This is the rework of apache#31887. Closes apache#31887. ### Why are the changes needed? This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable) For this query: ``` val df = Seq( (Seq(1,2,3), Seq("a", "b", "c")) ).toDF("numbers", "letters") df.select( f.flatten( f.transform( $"numbers", (number: Column) => { f.transform( $"letters", (letter: Column) => { f.struct( number.as("number"), letter.as("letter") ) } ) } ) ).as("zipped") ).show(10, false) ``` This is the current (incorrect) output: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` And this is the correct output after fix: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added the new test in `DataFrameFunctionsSuite`. Closes apache#32424 from maropu/pr31887. Lead-authored-by: dsolow <[email protected]> Co-authored-by: Takeshi Yamamuro <[email protected]> Co-authored-by: dmsolow <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit f550e03) Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit 6df4ec0) Signed-off-by: Dongjoon Hyun <[email protected]>

…e functions ### What changes were proposed in this pull request? To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions. This is the rework of apache#31887. Closes apache#31887. ### Why are the changes needed? This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable) For this query: ``` val df = Seq( (Seq(1,2,3), Seq("a", "b", "c")) ).toDF("numbers", "letters") df.select( f.flatten( f.transform( $"numbers", (number: Column) => { f.transform( $"letters", (letter: Column) => { f.struct( number.as("number"), letter.as("letter") ) } ) } ) ).as("zipped") ).show(10, false) ``` This is the current (incorrect) output: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` And this is the correct output after fix: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added the new test in `DataFrameFunctionsSuite`. Closes apache#32424 from maropu/pr31887. Lead-authored-by: dsolow <[email protected]> Co-authored-by: Takeshi Yamamuro <[email protected]> Co-authored-by: dmsolow <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit f550e03) Signed-off-by: Takeshi Yamamuro <[email protected]>

Add AtomicInteger to make var names unique

a0a1cc8

github-actions bot added the SQL label Mar 19, 2021

added test to verify nested transform

c13d776

RussellSpitzer reviewed Mar 19, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/functions.scala Outdated Show resolved Hide resolved

RussellSpitzer approved these changes Mar 19, 2021

View reviewed changes

fixed typo

8c2ad61

dmsolow changed the title ~~[WIP][SPARK-34794][SQL] fix nested transform issue~~ [SPARK-34794][SQL] fix nested transform issue Mar 22, 2021

maropu reviewed Apr 6, 2021

View reviewed changes

maropu changed the title ~~[SPARK-34794][SQL] fix nested transform issue~~ [SPARK-34794][SQL] Fix nested transform issue Apr 6, 2021

maropu mentioned this pull request May 3, 2021

[SPARK-34794][SQL] Fix lambda variable name issues in nested DataFrame functions #32424

Closed

maropu closed this in f550e03 May 5, 2021

[SPARK-34794][SQL] Fix nested transform issue #31887

[SPARK-34794][SQL] Fix nested transform issue #31887

Uh oh!

Conversation

dmsolow commented Mar 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

nvander1 commented Mar 19, 2021

Uh oh!

RussellSpitzer commented Mar 19, 2021

Uh oh!

dmsolow commented Mar 19, 2021

Uh oh!

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

tanelk commented Apr 5, 2021

Uh oh!

maropu commented Apr 6, 2021

Uh oh!

maropu commented Apr 6, 2021

Uh oh!

maropu left a comment

Choose a reason for hiding this comment

Uh oh!

maropu Apr 6, 2021

Choose a reason for hiding this comment

Uh oh!

maropu Apr 6, 2021

Choose a reason for hiding this comment

Uh oh!

maropu Apr 6, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

maropu commented Apr 26, 2021

Uh oh!

dmsolow commented Apr 27, 2021

Uh oh!

HyukjinKwon commented Apr 28, 2021

Uh oh!

maropu commented Apr 28, 2021

Uh oh!

dmsolow commented Apr 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

dmsolow commented Mar 19, 2021 •

edited

Loading