-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-34794][SQL] Fix nested transform issue #31887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Maybe you can make a test with an example of the shadowing to demonstrate how it works correctly now. |
|
I think your example from slack would be prefect as a test. Especially because of the divergent behavior between Spark SQL and the programmatic api |
|
@nvander1 test added |
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nit, looks good to me other than that
|
pinging @maropu, @HyukjinKwon |
|
ok to test |
|
|
maropu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, @dmsolow !
| assert(ex3.getMessage.contains("cannot resolve 'a'")) | ||
| } | ||
|
|
||
| test("nested transform (DSL)") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: nested transform (DSL) -> SPARK-34794: nested transform
| } | ||
|
|
||
| // counter to ensure lambda variable names are unique | ||
| private val lambdaVarNameCounter = new AtomicInteger(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about defining a companion object for UnresolvedNamedLambdaVariable like this?
object UnresolvedNamedLambdaVariable {
// counter to ensure lambda variable names are unique
private val lambdaVarNameCounter = new AtomicInteger(0)
def apply(args: Seq[String]): UnresolvedNamedLambdaVariable = {
// add a counter number in the suffix
}
}
| flatten( | ||
| transform( | ||
| $"numbers", | ||
| (number: Column) => transform( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need exhaustive tests for this issue, e.g., two/three argument cases and their combination cases.
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #136924 has finished for PR 31887 at commit
|
|
Kindly ping. |
|
@maropu I won't be able to work on this until June 1, other duties are getting in the way. |
|
cc @ueshin too FYI |
Ah, I see. Is it okay that I take this over? |
Go ahead. |
…e functions ### What changes were proposed in this pull request? To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions. This is the rework of #31887. Closes #31887. ### Why are the changes needed? This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable) For this query: ``` val df = Seq( (Seq(1,2,3), Seq("a", "b", "c")) ).toDF("numbers", "letters") df.select( f.flatten( f.transform( $"numbers", (number: Column) => { f.transform( $"letters", (letter: Column) => { f.struct( number.as("number"), letter.as("letter") ) } ) } ) ).as("zipped") ).show(10, false) ``` This is the current (incorrect) output: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` And this is the correct output after fix: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added the new test in `DataFrameFunctionsSuite`. Closes #32424 from maropu/pr31887. Lead-authored-by: dsolow <[email protected]> Co-authored-by: Takeshi Yamamuro <[email protected]> Co-authored-by: dmsolow <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit f550e03) Signed-off-by: Takeshi Yamamuro <[email protected]>
…e functions ### What changes were proposed in this pull request? To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions. This is the rework of #31887. Closes #31887. ### Why are the changes needed? This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable) For this query: ``` val df = Seq( (Seq(1,2,3), Seq("a", "b", "c")) ).toDF("numbers", "letters") df.select( f.flatten( f.transform( $"numbers", (number: Column) => { f.transform( $"letters", (letter: Column) => { f.struct( number.as("number"), letter.as("letter") ) } ) } ) ).as("zipped") ).show(10, false) ``` This is the current (incorrect) output: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` And this is the correct output after fix: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added the new test in `DataFrameFunctionsSuite`. Closes #32424 from maropu/pr31887. Lead-authored-by: dsolow <[email protected]> Co-authored-by: Takeshi Yamamuro <[email protected]> Co-authored-by: dmsolow <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit f550e03) Signed-off-by: Takeshi Yamamuro <[email protected]>
…e functions ### What changes were proposed in this pull request? To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions. This is the rework of apache#31887. Closes apache#31887. ### Why are the changes needed? This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable) For this query: ``` val df = Seq( (Seq(1,2,3), Seq("a", "b", "c")) ).toDF("numbers", "letters") df.select( f.flatten( f.transform( $"numbers", (number: Column) => { f.transform( $"letters", (letter: Column) => { f.struct( number.as("number"), letter.as("letter") ) } ) } ) ).as("zipped") ).show(10, false) ``` This is the current (incorrect) output: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` And this is the correct output after fix: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added the new test in `DataFrameFunctionsSuite`. Closes apache#32424 from maropu/pr31887. Lead-authored-by: dsolow <[email protected]> Co-authored-by: Takeshi Yamamuro <[email protected]> Co-authored-by: dmsolow <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit f550e03) Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit 6df4ec0) Signed-off-by: Dongjoon Hyun <[email protected]>
…e functions ### What changes were proposed in this pull request? To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions. This is the rework of apache#31887. Closes apache#31887. ### Why are the changes needed? This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable) For this query: ``` val df = Seq( (Seq(1,2,3), Seq("a", "b", "c")) ).toDF("numbers", "letters") df.select( f.flatten( f.transform( $"numbers", (number: Column) => { f.transform( $"letters", (letter: Column) => { f.struct( number.as("number"), letter.as("letter") ) } ) } ) ).as("zipped") ).show(10, false) ``` This is the current (incorrect) output: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` And this is the correct output after fix: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added the new test in `DataFrameFunctionsSuite`. Closes apache#32424 from maropu/pr31887. Lead-authored-by: dsolow <[email protected]> Co-authored-by: Takeshi Yamamuro <[email protected]> Co-authored-by: dmsolow <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit f550e03) Signed-off-by: Takeshi Yamamuro <[email protected]>
What changes were proposed in this pull request?
Increment a global counter and use the current value to name LambdaVariables created by higher order functions.
Why are the changes needed?
This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable)
For this query:
This is the current (incorrect) output:
And this is the correct output after fix:
Does this PR introduce any user-facing change?
No
How was this patch tested?
I'm not sure how to test this because the current tests don't run the
functions.transformdirectly@nvander1 @RussellSpitzer