-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32796][SQL] Make withField API support nested struct in array #29645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| case q: LogicalPlan => | ||
| q.transformExpressions { | ||
| case expr if !expr.childrenResolved => expr | ||
| case e: UnresolvedWithFields => WithFields(e.col, e.fieldName, e.expr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be moved to other proper rule. Just not sure which one is good, so put as an individual rule first.
|
Test build #128278 has finished for PR 29645 at commit
|
|
retest this please |
|
|
||
| test("withField should add field to struct of array") { | ||
| checkAnswerAndSchema( | ||
| arrayLevel1.withColumn("a", 'a.withField("d", lit(4))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
I personally more prefer an explicit call such as |
|
I agree with @HyukjinKwon that it's better to be more explicit. |
|
Test build #128288 has finished for PR 29645 at commit
|
|
@cloud-fan @HyukjinKwon Thanks for comment. So let me get more ideas from you. This is like a syntax sugar to more easily express complex nested |
27f37dc to
39edb0a
Compare
|
An example looks like: private lazy val arrayType = ArrayType(
StructType(Seq(
StructField("a", IntegerType, nullable = false),
StructField("b", IntegerType, nullable = true),
StructField("c", IntegerType, nullable = false))),
containsNull = true)
private lazy val arrayStructArrayLevel1: DataFrame = spark.createDataFrame(
sparkContext.parallelize(Row(Array(Row(Array(Row(1, null, 3)), null, 3))) :: Nil),
StructType(
Seq(StructField("a", ArrayType(
StructType(Seq(
StructField("a", arrayType, nullable = false),
StructField("b", IntegerType, nullable = true),
StructField("c", IntegerType, nullable = false))),
containsNull = false)))))The data looks like: In order to replace deeply nested Currently by using arrayStructArrayLevel1.withColumn("a",
transform($"a", _.withField("a",
flatten(transform($"a.a", transform(_, _.withField("b", lit(2))))))))Using modified arrayStructArrayLevel1.withColumn("a", $"a".withField("a.b", lit(2)))It could significantly simplify how we add/replace deeply nested fields. |
|
Test build #128313 has finished for PR 29645 at commit
|
|
Test build #128314 has finished for PR 29645 at commit
|
|
We can save more code by supporting array of array of struct. It's a trade-off between "clear and simple semantic" vs "flexiblity of supporting various input types". |
|
@cloud-fan Sorry if I mis-read your comment. Do you mean we should support array of array of struct? |
|
I mean we should prefer "clear and simple semantic", otherwise people can always ask to be more flexible and save more code, like supporting array of array of struct. |
|
Okay, I see. It also makes sense to me. This is a hard trade-off between simplicity and flexibility. I will close this now. If we need this flexibility in the future, we can revisit it. |
|
Could you revisit this? |
|
Does |
|
Unfortunately it doesn't. I'm using spark through the pyspark library |
|
@HyukjinKwon shall we add the |
|
There is a |
|
Do you mean the |
|
Yes. Nevermind then, I retract my request. |
What changes were proposed in this pull request?
This patch adds nested struct support to
Column.withFieldAPI.Why are the changes needed?
Currently
Column.withFieldonly supportsStructType. For nested struct inArrayType, it doesn't support. We can support nested struct in array to make the API more general and useful.Does this PR introduce any user-facing change?
Yes. Adding nested struct support to
Column.withFieldAPI.How was this patch tested?
Unit tests.