[SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns #42828

zhengruifeng · 2023-09-06T05:05:02Z

What changes were proposed in this pull request?

Make getitem work with duplicated columns
~~Disallow bool type index~~
~~Disallow negative index~~

Why are the changes needed?

1, SQL feature OrderBy ordinal works with duplicated columns

In [4]: df = spark.sql("SELECT * FROM VALUES (1, 1.1, 'a'), (2, 2.2, 'b'), (4, 4.4, 'c') AS TAB(a, a, a)")

In [5]: df.createOrReplaceTempView("v")

In [6]: spark.sql("SELECT * FROM v ORDER BY 1, 2").show()
+---+---+---+
|  a|  a|  a|
+---+---+---+
|  1|1.1|  a|
|  2|2.2|  b|
|  4|4.4|  c|
+---+---+---+

To support it in DataFame APIs, we need to make getitem work with duplicated columns

~~2 & 3: should be unintentional~~

Does this PR introduce any user-facing change?

YES

1, Make getitem work with duplicated columns
before

In [1]: df = spark.sql("SELECT * FROM VALUES (1, 1.1, 'a'), (2, 2.2, 'b'), (4, 4.4, 'c') AS TAB(a, a, a)")

In [2]: df[0]
---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
Cell In[2], line 1
----> 1 df[0]
...

AnalysisException: [AMBIGUOUS_REFERENCE] Reference `a` is ambiguous, could be: [`TAB`.`a`, `TAB`.`a`, `TAB`.`a`].

In [3]: df[1]
---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
...

AnalysisException: [AMBIGUOUS_REFERENCE] Reference `a` is ambiguous, could be: [`TAB`.`a`, `TAB`.`a`, `TAB`.`a`].

In [4]: df.orderBy(1, 2).show()
---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
Cell In[7], line 1
----> 1 df.orderBy(1, 2).show()

...

AnalysisException: [AMBIGUOUS_REFERENCE] Reference `a` is ambiguous, could be: [`TAB`.`a`, `TAB`.`a`, `TAB`.`a`].

after

In [1]: df = spark.sql("SELECT * FROM VALUES (1, 1.1, 'a'), (2, 2.2, 'b'), (4, 4.4, 'c') AS TAB(a, a, a)")

In [2]: df[0]
Out[2]: Column<'a'>

In [3]: df[1]
Out[3]: Column<'a'>

In [4]: df.orderBy(1, 2).show()
+---+---+---+
|  a|  a|  a|
+---+---+---+
|  1|1.1|  a|
|  2|2.2|  b|
|  4|4.4|  c|
+---+---+---+

How was this patch tested?

added UTs

Was this patch authored or co-authored using generative AI tooling?

NO

python/pyspark/sql/dataframe.py

HyukjinKwon

LGTM otherwise

zhengruifeng · 2023-09-07T06:20:38Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

also cc @cloud-fan for the usage of GetColumnByOrdinal

dongjoon-hyun · 2023-09-08T19:49:00Z

python/pyspark/sql/tests/test_dataframe.py

Just a question. Is this a relevant change?

it is not related, it is a not-used import. since we are touching this file, what about also removing it btw?

dongjoon-hyun · 2023-09-08T19:49:38Z

python/pyspark/sql/tests/test_dataframe.py

Why don't we delete this? Is this comment required still?

nice, will remove it.

dongjoon-hyun

+1, LGTM (except two minor questions)

init

zhengruifeng · 2023-09-14T10:27:40Z

merged to master, thanks @dongjoon-hyun and @HyukjinKwon for review

cloud-fan · 2023-09-27T07:07:32Z

python/pyspark/sql/tests/test_dataframe.py

+
+        # accepted type and values
+        for index in [False, True, 0, 1, 2, -1, -2, -3]:
+            df[index]


This is really a bad API. df.col can be ambiguous as people may use the column reference far away from the dataframe, e.g. df1.join(df2).select...filter...select(df1.col). We recommend users use qualified unresolved column instead, like col("t1.col"). Now df[index] is even worse as it only makes sense to use it immediately in current df's transformation.

Why do we add such an API? To support order by ordinal, we can just order by integer literals. The SQL parser also parses ORDER BY 1, 2 as ordering by integer literal 1 and 2, and analyzer will properly resolve it.

cc @HyukjinKwon @zhengruifeng

If df[index] is already in pyspark for a while, I think it's fine to treat it as a shortcut of df.i_th_col. We shouldn't use GetColumnByOrdinal in this case, as it was added for Dataset Tuple encoding and it's guaranteed that we want to get the column from the direct child of the current plan node. But here, we can't guarantee this, as people can do df1.select..filter...groupBy...select(df1[index])

6183b5e

df[index] has been supported since spark 2.0.0.

To support df.groupBy(1, 2, 3) and df.orderBy(1, 2, 3), right now GetColumnByOrdinal is only used in the direct child internally.

The SQL parser also parses ORDER BY 1, 2 as ordering by integer literal 1 and 2, and analyzer will properly resolve it.

Do you mean use should directly SortOrder(UnresolvedOrdinal(index)) ?

have offline discussion with wenchen, will fix it by switching to SortOrder(Literal(index)). Will fix it next week.

scala> val df = Seq((2, 1), (1, 2)).toDF("a", "b") val df: org.apache.spark.sql.DataFrame = [a: int, b: int] scala> df.show() +---+---+ | a| b| +---+---+ | 2| 1| | 1| 2| +---+---+ scala> df.orderBy(lit(1)).show() +---+---+ | a| b| +---+---+ | 1| 2| | 2| 1| +---+---+ scala> df.groupBy(lit(1)).agg(first(col("a")), max(col("b"))).show() +---+--------+------+ | 1|first(a)|max(b)| +---+--------+------+ | 1| 2| 2| +---+--------+------+

it seems orderBy(lit(1)) directly works, while groupBy(lit(1)) needs some investigation.

Let me revert this PR first

…cated column" ### What changes were proposed in this pull request? This reverts commit 73d3c49. ### Why are the changes needed? to address #42828 (comment) and #43115 (comment), should not use `GetColumnByOrdinal` in this case. Need to find another approach, but let's revert it first. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #43211 from zhengruifeng/revert_SPARK_45088. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

…cated column" ### What changes were proposed in this pull request? This reverts commit 73d3c49. ### Why are the changes needed? to address apache#42828 (comment) and apache#43115 (comment), should not use `GetColumnByOrdinal` in this case. Need to find another approach, but let's revert it first. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#43211 from zhengruifeng/revert_SPARK_45088. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

github-actions bot added SQL PYTHON CONNECT labels Sep 6, 2023

zhengruifeng force-pushed the col_by_index branch 2 times, most recently from 48558ce to a0b97c6 Compare September 7, 2023 00:40

HyukjinKwon reviewed Sep 7, 2023

View reviewed changes

python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Sep 7, 2023

View reviewed changes

python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved

HyukjinKwon approved these changes Sep 7, 2023

View reviewed changes

zhengruifeng force-pushed the col_by_index branch from a0b97c6 to ac4642d Compare September 7, 2023 06:18

zhengruifeng commented Sep 7, 2023

View reviewed changes

dongjoon-hyun reviewed Sep 8, 2023

View reviewed changes

dongjoon-hyun approved these changes Sep 8, 2023

View reviewed changes

zhengruifeng force-pushed the col_by_index branch from ac4642d to 368fafa Compare September 11, 2023 00:21

zhengruifeng added 5 commits September 13, 2023 15:56

init

0cc6cdb

init

nit

4b6f73b

nit

2e09890

address comments

c72d433

del comment

fa78a77

zhengruifeng force-pushed the col_by_index branch from 368fafa to fa78a77 Compare September 13, 2023 07:56

zhengruifeng closed this in 73d3c49 Sep 14, 2023

zhengruifeng deleted the col_by_index branch September 14, 2023 10:27

zhengruifeng mentioned this pull request Sep 27, 2023

[TEST ONLY][SQL] Test resolve column references with PLAN_ID #43115

Closed

cloud-fan reviewed Sep 27, 2023

View reviewed changes

zhengruifeng mentioned this pull request Oct 4, 2023

Revert "[SPARK-45088][PYTHON][CONNECT] Make getitem work with duplicated column" #43211

Closed

[SPARK-45088][PYTHON][CONNECT] Make getitem work with duplicated columns #42828

[SPARK-45088][PYTHON][CONNECT] Make getitem work with duplicated columns #42828

Uh oh!

Conversation

zhengruifeng commented Sep 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Sep 14, 2023

Uh oh!

cloud-fan Sep 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns #42828

[SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns #42828

zhengruifeng commented Sep 6, 2023 •

edited

Loading

cloud-fan Sep 27, 2023 •

edited

Loading