Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Sep 4, 2020

What changes were proposed in this pull request?

This patch adds nested struct support to Column.withField API.

Why are the changes needed?

Currently Column.withField only supports StructType. For nested struct in ArrayType, it doesn't support. We can support nested struct in array to make the API more general and useful.

Does this PR introduce any user-facing change?

Yes. Adding nested struct support to Column.withField API.

How was this patch tested?

Unit tests.

case q: LogicalPlan =>
q.transformExpressions {
case expr if !expr.childrenResolved => expr
case e: UnresolvedWithFields => WithFields(e.col, e.fieldName, e.expr)
Copy link
Member Author

@viirya viirya Sep 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be moved to other proper rule. Just not sure which one is good, so put as an individual rule first.

@SparkQA
Copy link

SparkQA commented Sep 4, 2020

Test build #128278 has finished for PR 29645 at commit 643728d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UnresolvedWithFields(

@HyukjinKwon
Copy link
Member

retest this please


test("withField should add field to struct of array") {
checkAnswerAndSchema(
arrayLevel1.withColumn("a", 'a.withField("d", lit(4))),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya, BTW, is it same as:

transform($"struct_col", _.withField("new_col", col))

?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 4, 2020

I personally more prefer an explicit call such as transform($"struct_col", _.withField("new_col", col)) as it's more explicit but I got that this can be sort of similar with the nested column access in SQL like a.b supports the arrays of structs. No strong opinion on this. WDYT @cloud-fan.

@cloud-fan
Copy link
Contributor

I agree with @HyukjinKwon that it's better to be more explicit. a.b is a bad example, as it arbitrarily decides to support only one level, array of array of struct is not supported. I prefer to not make the same mistake in withField.

@SparkQA
Copy link

SparkQA commented Sep 4, 2020

Test build #128288 has finished for PR 29645 at commit 643728d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UnresolvedWithFields(

@viirya
Copy link
Member Author

viirya commented Sep 4, 2020

@cloud-fan @HyukjinKwon Thanks for comment. So let me get more ideas from you. This is like a syntax sugar to more easily express complex nested withField chain. I think we can explicitly expression it by using transform + withField, but as it goes deeply more, the syntax becomes hard to express and read. This doesn't create new syntax other than a.b used to access nested field in array, but just leverages existing one.

@viirya
Copy link
Member Author

viirya commented Sep 5, 2020

An example looks like:

private lazy val arrayType = ArrayType(
    StructType(Seq(
      StructField("a", IntegerType, nullable = false),
      StructField("b", IntegerType, nullable = true),
      StructField("c", IntegerType, nullable = false))),
    containsNull = true)

private lazy val arrayStructArrayLevel1: DataFrame = spark.createDataFrame(
    sparkContext.parallelize(Row(Array(Row(Array(Row(1, null, 3)), null, 3))) :: Nil),
    StructType(
      Seq(StructField("a", ArrayType(
        StructType(Seq(
          StructField("a", arrayType, nullable = false),
          StructField("b", IntegerType, nullable = true),
          StructField("c", IntegerType, nullable = false))),
        containsNull = false)))))

The data looks like:

+---------------------------+
|a                          |
+---------------------------+
|[{[{1, null, 3}], null, 3}]|
+---------------------------+

In order to replace deeply nested b column, like:

+------------------------+
|a                       |
+------------------------+
|[{[{1, 2, 3}], null, 3}]|
+------------------------+

Currently by using transform + withField, we probably need the following expressions. It looks complicated and takes a while to write it.

arrayStructArrayLevel1.withColumn("a",
      transform($"a", _.withField("a",
        flatten(transform($"a.a", transform(_, _.withField("b", lit(2))))))))

Using modified withField, we can do it like:

arrayStructArrayLevel1.withColumn("a", $"a".withField("a.b", lit(2)))

It could significantly simplify how we add/replace deeply nested fields.

@SparkQA
Copy link

SparkQA commented Sep 5, 2020

Test build #128313 has finished for PR 29645 at commit 27f37dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 5, 2020

Test build #128314 has finished for PR 29645 at commit 39edb0a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

We can save more code by supporting array of array of struct. It's a trade-off between "clear and simple semantic" vs "flexiblity of supporting various input types".

@viirya
Copy link
Member Author

viirya commented Sep 7, 2020

@cloud-fan Sorry if I mis-read your comment. Do you mean we should support array of array of struct?

@cloud-fan
Copy link
Contributor

I mean we should prefer "clear and simple semantic", otherwise people can always ask to be more flexible and save more code, like supporting array of array of struct.

@viirya
Copy link
Member Author

viirya commented Sep 8, 2020

Okay, I see. It also makes sense to me. This is a hard trade-off between simplicity and flexibility. I will close this now. If we need this flexibility in the future, we can revisit it.

@encarvlucas
Copy link

Could you revisit this?

@cloud-fan
Copy link
Contributor

Does transform($"struct_col", _.withField("new_col", col)) work for you?

@encarvlucas
Copy link

Unfortunately it doesn't. I'm using spark through the pyspark library

@cloud-fan
Copy link
Contributor

@HyukjinKwon shall we add the transform API in pyspark?

@encarvlucas
Copy link

There is a transform implementation in pyspark, but it does not work with python UDFs. Might be related to this issue.
Also, there is no withField function in Spark SQL.

@cloud-fan
Copy link
Contributor

Do you mean the col in transform($"struct_col", _.withField("new_col", col)) is a python udf? I'm not sure if this PR can help due to the limitation of python udf execution...

@encarvlucas
Copy link

Yes. Nevermind then, I retract my request.

@viirya viirya deleted the with-field-array branch December 27, 2023 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants