-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-45088][PYTHON][CONNECT] Make getitem work with duplicated columns
#42828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
48558ce to
a0b97c6
Compare
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise
a0b97c6 to
ac4642d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also cc @cloud-fan for the usage of GetColumnByOrdinal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question. Is this a relevant change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not related, it is a not-used import. since we are touching this file, what about also removing it btw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we delete this? Is this comment required still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, will remove it.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM (except two minor questions)
ac4642d to
368fafa
Compare
368fafa to
fa78a77
Compare
|
merged to master, thanks @dongjoon-hyun and @HyukjinKwon for review |
|
|
||
| # accepted type and values | ||
| for index in [False, True, 0, 1, 2, -1, -2, -3]: | ||
| df[index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really a bad API. df.col can be ambiguous as people may use the column reference far away from the dataframe, e.g. df1.join(df2).select...filter...select(df1.col). We recommend users use qualified unresolved column instead, like col("t1.col"). Now df[index] is even worse as it only makes sense to use it immediately in current df's transformation.
Why do we add such an API? To support order by ordinal, we can just order by integer literals. The SQL parser also parses ORDER BY 1, 2 as ordering by integer literal 1 and 2, and analyzer will properly resolve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If df[index] is already in pyspark for a while, I think it's fine to treat it as a shortcut of df.i_th_col. We shouldn't use GetColumnByOrdinal in this case, as it was added for Dataset Tuple encoding and it's guaranteed that we want to get the column from the direct child of the current plan node. But here, we can't guarantee this, as people can do df1.select..filter...groupBy...select(df1[index])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df[index] has been supported since spark 2.0.0.
To support df.groupBy(1, 2, 3) and df.orderBy(1, 2, 3), right now GetColumnByOrdinal is only used in the direct child internally.
The SQL parser also parses ORDER BY 1, 2 as ordering by integer literal 1 and 2, and analyzer will properly resolve it.
Do you mean use should directly SortOrder(UnresolvedOrdinal(index)) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have offline discussion with wenchen, will fix it by switching to SortOrder(Literal(index)). Will fix it next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scala> val df = Seq((2, 1), (1, 2)).toDF("a", "b")
val df: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> df.show()
+---+---+
| a| b|
+---+---+
| 2| 1|
| 1| 2|
+---+---+
scala> df.orderBy(lit(1)).show()
+---+---+
| a| b|
+---+---+
| 1| 2|
| 2| 1|
+---+---+
scala> df.groupBy(lit(1)).agg(first(col("a")), max(col("b"))).show()
+---+--------+------+
| 1|first(a)|max(b)|
+---+--------+------+
| 1| 2| 2|
+---+--------+------+
it seems orderBy(lit(1)) directly works, while groupBy(lit(1)) needs some investigation.
Let me revert this PR first
…cated column" ### What changes were proposed in this pull request? This reverts commit 73d3c49. ### Why are the changes needed? to address #42828 (comment) and #43115 (comment), should not use `GetColumnByOrdinal` in this case. Need to find another approach, but let's revert it first. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #43211 from zhengruifeng/revert_SPARK_45088. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
…cated column" ### What changes were proposed in this pull request? This reverts commit 73d3c49. ### Why are the changes needed? to address apache#42828 (comment) and apache#43115 (comment), should not use `GetColumnByOrdinal` in this case. Need to find another approach, but let's revert it first. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#43211 from zhengruifeng/revert_SPARK_45088. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
What changes were proposed in this pull request?
getitemwork with duplicated columnsDisallow bool type indexDisallow negative indexWhy are the changes needed?
1, SQL feature OrderBy ordinal works with duplicated columns
To support it in DataFame APIs, we need to make
getitemwork with duplicated columns2 & 3: should be unintentionalDoes this PR introduce any user-facing change?
YES
1, Make
getitemwork with duplicated columnsbefore
after
How was this patch tested?
added UTs
Was this patch authored or co-authored using generative AI tooling?
NO