[SPARK-41812][SPARK-41823][CONNECT][SQL][PYTHON] Resolve ambiguous columns issue in `Join` #39925

zhengruifeng · 2023-02-07T07:58:45Z

What changes were proposed in this pull request?

In Python Client

generate plan_id for each proto plan (It's up to the Client to guarantee the uniqueness);
attach plan_id to the column created by DataFrame[col_name] or DataFrame.col_name;
Note that F.col(col_name) doesn't have plan_id;

In Connect Planner:

attach plan_id to UnresolvedAttributes and LogicalPlan s via TreeNodeTag

In Analyzer:

for an UnresolvedAttribute with plan_id, search the matching node in the plan, and resolve it with the found node if possible

Out of scope:

resolve self-join
add a DetectAmbiguousSelfJoin-like rule for detection

Why are the changes needed?

Fix bug, before this PR:

df1.join(df2, df1["value"] == df2["value"])  <- fail due to can not resolve `value`
df1.join(df2, df1["value"] == df2["value"]).select(df1.value) <- fail due to can not resolve `value`
df1.select(df2.value)    <- should fail, but run as `df1.select(df1.value)` and return the incorrect results

Does this PR introduce any user-facing change?

yes

How was this patch tested?

added tests, enabled tests

zhengruifeng · 2023-02-07T09:10:46Z

cc @cloud-fan

connector/connect/common/src/main/protobuf/spark/connect/relations.proto

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

connector/connect/common/src/main/protobuf/spark/connect/relations.proto

grundprinzip · 2023-02-08T14:53:21Z

python/pyspark/sql/connect/plan.py

why this refactoring in this pr?

here and below?

Another question is why plan is a better var name compared to rel? Because it's a Relation not a Plan?

both plan and rel are fine to me, I just want to make the naming consistent.

python/pyspark/sql/connect/plan.py

cloud-fan · 2023-02-10T03:02:33Z

connector/connect/common/src/main/protobuf/spark/connect/relations.proto

shall we mention it's per-client global?

cloud-fan · 2023-02-10T03:06:40Z

python/pyspark/sql/connect/dataframe.py

can this really happen?

I don't think so

add this check just because of self._plan's type

self._plan: Optional[plan.LogicalPlan]

why the plan is optional? this is not related to this PR and we can address later.

don't know why. Let's fix it later

cloud-fan · 2023-02-10T03:07:53Z

python/pyspark/sql/connect/dataframe.py

is this the only place returning ColumnReference ?

functions.col also return ColumnReference, but without plan_id

https://github.com/apache/spark/blob/6f376676ee21a228ca662fcfc12ff70d11acabfe/python/pyspark/sql/connect/functions.py#L204-L210

will be great if we can reduce code duplication somehow

like having a private version of def col which takes an extra plan id.

nice, will update

cloud-fan · 2023-02-10T03:09:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala

let's add some comments to explain this special branch.

cloud-fan · 2023-02-10T03:11:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala

if an attribute has a plan id, I think it should never be resolved to an outer reference

ok. I think we can narrow the case to UnresolvedAttribute -> AttributeReference

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

cloud-fan · 2023-02-10T07:28:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala

we don't need to do this check. Even if the attribute reference is dangling, we should still use it and fail later. You can check the behavior of normal dataframe using df1.select(df2.col)

if we want to fail this case:

>>> df1 = spark.range(0, 10) >>> df2 = spark.range(0, 10) >>> df1 DataFrame[id: bigint] >>> df1.select(df2.id) DataFrame[id: bigint] >>> df1.select(df2.id).show() +---+ | id| +---+ | 0| | 1| | 2| | 3| | 4| | 5| | 6| | 7| | 8| | 9| +---+

I guess we should make it failed if can not find the matching node?

yea, what's more

df1 = .... columns [a, b, c] df2 = df1.select("a") df2.select(df1.b)

this should fail as well, with missing attribute error.

yes, add a new test for it.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

python/pyspark/sql/connect/dataframe.py

python/pyspark/sql/connect/plan.py

zhengruifeng · 2023-02-13T03:16:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

https://github.com/apache/spark/blob/9fa9d4b93176dcdf5f1e3d7c883956dc3f554508/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3492-L3499

commonNaturalJoinProcessing resolves Join to Project(Join) and discard the plan id, make this change to hold the plan id, otherwise following case will fail due to can not find the subplan:

left = spark.createDataFrame([Row(a=1)]) right = spark.createDataFrame([Row(a=1)]) df = left.join(right, on="a", how="left_outer") df.withColumn("b", udf(lambda x: "x")(df.a)). <- can not resolve `df.a`

zhengruifeng · 2023-02-13T03:17:44Z

python/pyspark/sql/tests/connect/test_connect_basic.py

add tests for df1.select(df2.col) cc @cloud-fan

zhengruifeng · 2023-02-13T03:19:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala

we need to fail here in order to fail invalid cases like df1.select(df2.col)

cloud-fan · 2023-02-13T03:44:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

should we reset the plan id to the new join or the new project?

I'm not 100% sure about this.

it seems that setting the plan id to new Join also works

shall we update the code inside commonNaturalJoinProcessing and retain the id in Join? That seems more natural.

cloud-fan · 2023-02-13T03:45:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala

this line can be removed. the next line covers it.

cloud-fan · 2023-02-13T03:47:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala

what does the match do? can we simply write plan.resolve(u.nameParts, conf.resolver)? It seems wrong to limit the result to be AttributeReference, which rejects nested cols.

got it, will update

fix 2 fix 3 fix 4 fix 5 fix scala linter r r r address comments nit address comments address comments doc address comments address comments init init

fail df1.select(df2.col) fix join

…lumns issue in `Join` ### What changes were proposed in this pull request? In Python Client - generate `plan_id` for each proto plan (It's up to the Client to guarantee the uniqueness); - attach `plan_id` to the column created by `DataFrame[col_name]` or `DataFrame.col_name`; - Note that `F.col(col_name)` doesn't have `plan_id`; In Connect Planner: - attach `plan_id` to `UnresolvedAttribute`s and `LogicalPlan `s via `TreeNodeTag` In Analyzer: - for an `UnresolvedAttribute` with `plan_id`, search the matching node in the plan, and resolve it with the found node if possible **Out of scope:** - resolve `self-join` - add a `DetectAmbiguousSelfJoin`-like rule for detection ### Why are the changes needed? Fix bug, before this PR: ``` df1.join(df2, df1["value"] == df2["value"]) <- fail due to can not resolve `value` df1.join(df2, df1["value"] == df2["value"]).select(df1.value) <- fail due to can not resolve `value` df1.select(df2.value) <- should fail, but run as `df1.select(df1.value)` and return the incorrect results ``` ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? added tests, enabled tests Closes #39925 from zhengruifeng/connect_plan_id. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]> (cherry picked from commit 167bbca) Signed-off-by: Ruifeng Zheng <[email protected]>

zhengruifeng · 2023-02-13T08:23:33Z

merged into master/3.4

thank you all!

### What changes were proposed in this pull request? This is the scala version of #39925. We introduce a plan_id that is both used for each plan created by the scala client, and by the columns created when calling `Dataframe.col(..)` and `Dataframe.apply(..)`. This way we can later properly resolve the columns created for a specific Dataframe. ### Why are the changes needed? Joining columns created using Dataframe.apply(...) does not work when the column names are ambiguous. We should be able to figure out where a column comes from when they are created like this. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Updated golden files. Added test case to ClientE2ETestSuite. Closes #40156 from hvanhovell/SPARK-41823. Authored-by: Herman van Hovell <[email protected]> Signed-off-by: Herman van Hovell <[email protected]>

### What changes were proposed in this pull request? This is the scala version of #39925. We introduce a plan_id that is both used for each plan created by the scala client, and by the columns created when calling `Dataframe.col(..)` and `Dataframe.apply(..)`. This way we can later properly resolve the columns created for a specific Dataframe. ### Why are the changes needed? Joining columns created using Dataframe.apply(...) does not work when the column names are ambiguous. We should be able to figure out where a column comes from when they are created like this. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Updated golden files. Added test case to ClientE2ETestSuite. Closes #40156 from hvanhovell/SPARK-41823. Authored-by: Herman van Hovell <[email protected]> Signed-off-by: Herman van Hovell <[email protected]> (cherry picked from commit 6a24330) Signed-off-by: Herman van Hovell <[email protected]>

### What changes were proposed in this pull request? This is the scala version of apache/spark#39925. We introduce a plan_id that is both used for each plan created by the scala client, and by the columns created when calling `Dataframe.col(..)` and `Dataframe.apply(..)`. This way we can later properly resolve the columns created for a specific Dataframe. ### Why are the changes needed? Joining columns created using Dataframe.apply(...) does not work when the column names are ambiguous. We should be able to figure out where a column comes from when they are created like this. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Updated golden files. Added test case to ClientE2ETestSuite. Closes #40156 from hvanhovell/SPARK-41823. Authored-by: Herman van Hovell <[email protected]> Signed-off-by: Herman van Hovell <[email protected]>

…lumns issue in `Join` ### What changes were proposed in this pull request? In Python Client - generate `plan_id` for each proto plan (It's up to the Client to guarantee the uniqueness); - attach `plan_id` to the column created by `DataFrame[col_name]` or `DataFrame.col_name`; - Note that `F.col(col_name)` doesn't have `plan_id`; In Connect Planner: - attach `plan_id` to `UnresolvedAttribute`s and `LogicalPlan `s via `TreeNodeTag` In Analyzer: - for an `UnresolvedAttribute` with `plan_id`, search the matching node in the plan, and resolve it with the found node if possible **Out of scope:** - resolve `self-join` - add a `DetectAmbiguousSelfJoin`-like rule for detection ### Why are the changes needed? Fix bug, before this PR: ``` df1.join(df2, df1["value"] == df2["value"]) <- fail due to can not resolve `value` df1.join(df2, df1["value"] == df2["value"]).select(df1.value) <- fail due to can not resolve `value` df1.select(df2.value) <- should fail, but run as `df1.select(df1.value)` and return the incorrect results ``` ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? added tests, enabled tests Closes apache#39925 from zhengruifeng/connect_plan_id. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]> (cherry picked from commit 167bbca) Signed-off-by: Ruifeng Zheng <[email protected]>

### What changes were proposed in this pull request? This is the scala version of apache#39925. We introduce a plan_id that is both used for each plan created by the scala client, and by the columns created when calling `Dataframe.col(..)` and `Dataframe.apply(..)`. This way we can later properly resolve the columns created for a specific Dataframe. ### Why are the changes needed? Joining columns created using Dataframe.apply(...) does not work when the column names are ambiguous. We should be able to figure out where a column comes from when they are created like this. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Updated golden files. Added test case to ClientE2ETestSuite. Closes apache#40156 from hvanhovell/SPARK-41823. Authored-by: Herman van Hovell <[email protected]> Signed-off-by: Herman van Hovell <[email protected]> (cherry picked from commit 6a24330) Signed-off-by: Herman van Hovell <[email protected]>

…in the `PLAN_ID_TAG` ### What changes were proposed in this pull request? Make rule `ExtractWindowExpressions` retain the `PLAN_ID_TAG ` ### Why are the changes needed? In #39925, we introduced a new mechanism to resolve expression with specified plan. However, sometimes the plan ID might be discarded by some analyzer rules, and then some expressions can not be correctly resolved, this issue is the main blocker of PS on Connect. ### Does this PR introduce _any_ user-facing change? yes, a lot of Pandas APIs enabled ### How was this patch tested? Enable UTs Closes #42086 from zhengruifeng/ps_connect_analyze_window. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

…in the `PLAN_ID_TAG` ### What changes were proposed in this pull request? Make rule `ExtractWindowExpressions` retain the `PLAN_ID_TAG ` ### Why are the changes needed? In apache#39925, we introduced a new mechanism to resolve expression with specified plan. However, sometimes the plan ID might be discarded by some analyzer rules, and then some expressions can not be correctly resolved, this issue is the main blocker of PS on Connect. ### Does this PR introduce _any_ user-facing change? yes, a lot of Pandas APIs enabled ### How was this patch tested? Enable UTs Closes apache#42086 from zhengruifeng/ps_connect_analyze_window. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

github-actions bot added CONNECT CORE PYTHON SQL labels Feb 7, 2023

zhengruifeng mentioned this pull request Feb 7, 2023

[WIP][SPARK-41812][SPARK-41823][CONNECT][PYTHON] Fix ambiguous columns issue in Join #39734

Closed

zhengruifeng force-pushed the connect_plan_id branch from 77ca731 to a6fde92 Compare February 7, 2023 09:06

zhengruifeng marked this pull request as ready for review February 7, 2023 09:10

grundprinzip requested changes Feb 8, 2023

View reviewed changes

HyukjinKwon approved these changes Feb 10, 2023

View reviewed changes

zhengruifeng force-pushed the connect_plan_id branch 2 times, most recently from 1f9e2ee to 57631b0 Compare February 10, 2023 02:36

cloud-fan reviewed Feb 10, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Feb 10, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala Outdated Show resolved Hide resolved

zhengruifeng force-pushed the connect_plan_id branch 2 times, most recently from 99fbd08 to e5f1d21 Compare February 10, 2023 07:16

cloud-fan reviewed Feb 10, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala Outdated Show resolved Hide resolved

zhengruifeng force-pushed the connect_plan_id branch from b9afcae to 4d79c6a Compare February 10, 2023 08:49

grundprinzip reviewed Feb 10, 2023

View reviewed changes

python/pyspark/sql/connect/dataframe.py Outdated Show resolved Hide resolved

python/pyspark/sql/connect/plan.py Outdated Show resolved Hide resolved

zhengruifeng force-pushed the connect_plan_id branch from 23d076e to 9fa9d4b Compare February 13, 2023 03:12

zhengruifeng commented Feb 13, 2023

View reviewed changes

python/pyspark/sql/tests/connect/test_connect_basic.py Outdated

Copy link

Contributor Author

zhengruifeng Feb 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add tests for df1.select(df2.col) cc @cloud-fan

zhengruifeng commented Feb 13, 2023

View reviewed changes

cloud-fan reviewed Feb 13, 2023

View reviewed changes

zhengruifeng added 5 commits February 13, 2023 13:31

init

fa03090

fix 2 fix 3 fix 4 fix 5 fix scala linter r r r address comments nit address comments address comments doc address comments address comments init init

fix join

dc063be

fail df1.select(df2.col) fix join

add test

c3ee62f

address comments

68fa5e9

address comments

68968c0

zhengruifeng force-pushed the connect_plan_id branch from 7893c70 to 68968c0 Compare February 13, 2023 05:45

cloud-fan approved these changes Feb 13, 2023

View reviewed changes

zhengruifeng closed this in 167bbca Feb 13, 2023

zhengruifeng deleted the connect_plan_id branch February 13, 2023 08:26

hvanhovell mentioned this pull request Feb 24, 2023

[SPARK-41823][CONNECT] Scala Client resolve ambiguous columns in Join #40156

Closed

This was referenced Jul 17, 2023

[WIP][SPARK-43611][SQL][PS][CONNCECT] Fix unexpected AnalysisException from Spark Connect client #42040

Closed

[SPARK-43611][SQL][PS][CONNCECT] Make ExtractWindowExpressions retain the PLAN_ID_TAG #42086

Closed

zhengruifeng mentioned this pull request Sep 26, 2023

[TEST ONLY][SQL] Test resolve column references with PLAN_ID #43115

Closed

[SPARK-41812][SPARK-41823][CONNECT][SQL][PYTHON] Resolve ambiguous columns issue in Join #39925

[SPARK-41812][SPARK-41823][CONNECT][SQL][PYTHON] Resolve ambiguous columns issue in Join #39925

Uh oh!

Conversation

zhengruifeng commented Feb 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zhengruifeng commented Feb 7, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

[SPARK-41812][SPARK-41823][CONNECT][SQL][PYTHON] Resolve ambiguous columns issue in `Join` #39925

[SPARK-41812][SPARK-41823][CONNECT][SQL][PYTHON] Resolve ambiguous columns issue in `Join` #39925

zhengruifeng commented Feb 7, 2023 •

edited

Loading