Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Feb 7, 2023

What changes were proposed in this pull request?

In Python Client

  • generate plan_id for each proto plan (It's up to the Client to guarantee the uniqueness);
  • attach plan_id to the column created by DataFrame[col_name] or DataFrame.col_name;
  • Note that F.col(col_name) doesn't have plan_id;

In Connect Planner:

  • attach plan_id to UnresolvedAttributes and LogicalPlan s via TreeNodeTag

In Analyzer:

  • for an UnresolvedAttribute with plan_id, search the matching node in the plan, and resolve it with the found node if possible

Out of scope:

  • resolve self-join
  • add a DetectAmbiguousSelfJoin-like rule for detection

Why are the changes needed?

Fix bug, before this PR:

df1.join(df2, df1["value"] == df2["value"])  <- fail due to can not resolve `value`
df1.join(df2, df1["value"] == df2["value"]).select(df1.value) <- fail due to can not resolve `value`
df1.select(df2.value)    <- should fail, but run as `df1.select(df1.value)` and return the incorrect results

Does this PR introduce any user-facing change?

yes

How was this patch tested?

added tests, enabled tests

@zhengruifeng
Copy link
Contributor Author

cc @cloud-fan

Comment on lines 719 to 717
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this refactoring in this pr?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here and below?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question is why plan is a better var name compared to rel? Because it's a Relation not a Plan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both plan and rel are fine to me, I just want to make the naming consistent.

@zhengruifeng zhengruifeng force-pushed the connect_plan_id branch 2 times, most recently from 1f9e2ee to 57631b0 Compare February 10, 2023 02:36
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we mention it's per-client global?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this really happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so

add this check just because of self._plan's type

self._plan: Optional[plan.LogicalPlan]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the plan is optional? this is not related to this PR and we can address later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't know why. Let's fix it later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the only place returning ColumnReference ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be great if we can reduce code duplication somehow

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like having a private version of def col which takes an extra plan id.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, will update

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add some comments to explain this special branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if an attribute has a plan id, I think it should never be resolved to an outer reference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I think we can narrow the case to UnresolvedAttribute -> AttributeReference

@zhengruifeng zhengruifeng force-pushed the connect_plan_id branch 2 times, most recently from 99fbd08 to e5f1d21 Compare February 10, 2023 07:16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to do this check. Even if the attribute reference is dangling, we should still use it and fail later. You can check the behavior of normal dataframe using df1.select(df2.col)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to fail this case:

>>> df1 = spark.range(0, 10)
>>> df2 = spark.range(0, 10)
>>> df1
DataFrame[id: bigint]
>>> df1.select(df2.id)
DataFrame[id: bigint]
>>> df1.select(df2.id).show()
+---+                                                                           
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+

I guess we should make it failed if can not find the matching node?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, what's more

df1 = .... columns [a, b, c]
df2 = df1.select("a")
df2.select(df1.b)

this should fail as well, with missing attribute error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, add a new test for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/spark/blob/9fa9d4b93176dcdf5f1e3d7c883956dc3f554508/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3492-L3499

commonNaturalJoinProcessing resolves Join to Project(Join) and discard the plan id, make this change to hold the plan id, otherwise following case will fail due to can not find the subplan:

left = spark.createDataFrame([Row(a=1)])
right = spark.createDataFrame([Row(a=1)])
df = left.join(right, on="a", how="left_outer")
df.withColumn("b", udf(lambda x: "x")(df.a)).    <- can not resolve `df.a`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add tests for df1.select(df2.col) cc @cloud-fan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to fail here in order to fail invalid cases like df1.select(df2.col)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we reset the plan id to the new join or the new project?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure about this.

it seems that setting the plan id to new Join also works

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we update the code inside commonNaturalJoinProcessing and retain the id in Join? That seems more natural.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line can be removed. the next line covers it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does the match do? can we simply write plan.resolve(u.nameParts, conf.resolver)? It seems wrong to limit the result to be AttributeReference, which rejects nested cols.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, will update

fix 2

fix 3

fix 4

fix 5

fix scala linter

r

r

r

address comments

nit

address comments

address comments

doc

address comments

address comments

init

init
fail df1.select(df2.col)

fix join
zhengruifeng added a commit that referenced this pull request Feb 13, 2023
…lumns issue in `Join`

### What changes were proposed in this pull request?

In Python Client
- generate `plan_id` for each proto plan (It's up to the Client to guarantee the uniqueness);
- attach `plan_id` to the column created by `DataFrame[col_name]` or `DataFrame.col_name`;
- Note that `F.col(col_name)` doesn't have `plan_id`;

In Connect Planner:
- attach `plan_id` to `UnresolvedAttribute`s and `LogicalPlan `s via `TreeNodeTag`

In Analyzer:
- for an `UnresolvedAttribute` with `plan_id`, search the matching node in the plan, and resolve it with the found node if possible

**Out of scope:**

- resolve `self-join`
- add a `DetectAmbiguousSelfJoin`-like rule for detection

### Why are the changes needed?
Fix bug, before this PR:
```
df1.join(df2, df1["value"] == df2["value"])  <- fail due to can not resolve `value`
df1.join(df2, df1["value"] == df2["value"]).select(df1.value) <- fail due to can not resolve `value`
df1.select(df2.value)    <- should fail, but run as `df1.select(df1.value)` and return the incorrect results
```

### Does this PR introduce _any_ user-facing change?
yes

### How was this patch tested?
added tests, enabled tests

Closes #39925 from zhengruifeng/connect_plan_id.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
(cherry picked from commit 167bbca)
Signed-off-by: Ruifeng Zheng <[email protected]>
@zhengruifeng
Copy link
Contributor Author

merged into master/3.4

thank you all!

@zhengruifeng zhengruifeng deleted the connect_plan_id branch February 13, 2023 08:26
hvanhovell added a commit that referenced this pull request Feb 24, 2023
### What changes were proposed in this pull request?
This is the scala version of #39925.

We introduce a plan_id that is both used for each plan created by the scala client, and by the columns created when calling `Dataframe.col(..)` and `Dataframe.apply(..)`. This way we can later properly resolve the columns created for a specific Dataframe.

### Why are the changes needed?
Joining columns  created using Dataframe.apply(...) does not work when the column names are ambiguous. We should be able to figure out where a column comes from when they are created like this.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Updated golden files. Added test case to ClientE2ETestSuite.

Closes #40156 from hvanhovell/SPARK-41823.

Authored-by: Herman van Hovell <[email protected]>
Signed-off-by: Herman van Hovell <[email protected]>
hvanhovell added a commit that referenced this pull request Feb 24, 2023
### What changes were proposed in this pull request?
This is the scala version of #39925.

We introduce a plan_id that is both used for each plan created by the scala client, and by the columns created when calling `Dataframe.col(..)` and `Dataframe.apply(..)`. This way we can later properly resolve the columns created for a specific Dataframe.

### Why are the changes needed?
Joining columns  created using Dataframe.apply(...) does not work when the column names are ambiguous. We should be able to figure out where a column comes from when they are created like this.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Updated golden files. Added test case to ClientE2ETestSuite.

Closes #40156 from hvanhovell/SPARK-41823.

Authored-by: Herman van Hovell <[email protected]>
Signed-off-by: Herman van Hovell <[email protected]>
(cherry picked from commit 6a24330)
Signed-off-by: Herman van Hovell <[email protected]>
a0x8o added a commit to a0x8o/spark that referenced this pull request Feb 24, 2023
### What changes were proposed in this pull request?
This is the scala version of apache/spark#39925.

We introduce a plan_id that is both used for each plan created by the scala client, and by the columns created when calling `Dataframe.col(..)` and `Dataframe.apply(..)`. This way we can later properly resolve the columns created for a specific Dataframe.

### Why are the changes needed?
Joining columns  created using Dataframe.apply(...) does not work when the column names are ambiguous. We should be able to figure out where a column comes from when they are created like this.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Updated golden files. Added test case to ClientE2ETestSuite.

Closes #40156 from hvanhovell/SPARK-41823.

Authored-by: Herman van Hovell <[email protected]>
Signed-off-by: Herman van Hovell <[email protected]>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…lumns issue in `Join`

### What changes were proposed in this pull request?

In Python Client
- generate `plan_id` for each proto plan (It's up to the Client to guarantee the uniqueness);
- attach `plan_id` to the column created by `DataFrame[col_name]` or `DataFrame.col_name`;
- Note that `F.col(col_name)` doesn't have `plan_id`;

In Connect Planner:
- attach `plan_id` to `UnresolvedAttribute`s and `LogicalPlan `s via `TreeNodeTag`

In Analyzer:
- for an `UnresolvedAttribute` with `plan_id`, search the matching node in the plan, and resolve it with the found node if possible

**Out of scope:**

- resolve `self-join`
- add a `DetectAmbiguousSelfJoin`-like rule for detection

### Why are the changes needed?
Fix bug, before this PR:
```
df1.join(df2, df1["value"] == df2["value"])  <- fail due to can not resolve `value`
df1.join(df2, df1["value"] == df2["value"]).select(df1.value) <- fail due to can not resolve `value`
df1.select(df2.value)    <- should fail, but run as `df1.select(df1.value)` and return the incorrect results
```

### Does this PR introduce _any_ user-facing change?
yes

### How was this patch tested?
added tests, enabled tests

Closes apache#39925 from zhengruifeng/connect_plan_id.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
(cherry picked from commit 167bbca)
Signed-off-by: Ruifeng Zheng <[email protected]>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
### What changes were proposed in this pull request?
This is the scala version of apache#39925.

We introduce a plan_id that is both used for each plan created by the scala client, and by the columns created when calling `Dataframe.col(..)` and `Dataframe.apply(..)`. This way we can later properly resolve the columns created for a specific Dataframe.

### Why are the changes needed?
Joining columns  created using Dataframe.apply(...) does not work when the column names are ambiguous. We should be able to figure out where a column comes from when they are created like this.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Updated golden files. Added test case to ClientE2ETestSuite.

Closes apache#40156 from hvanhovell/SPARK-41823.

Authored-by: Herman van Hovell <[email protected]>
Signed-off-by: Herman van Hovell <[email protected]>
(cherry picked from commit 6a24330)
Signed-off-by: Herman van Hovell <[email protected]>
zhengruifeng added a commit that referenced this pull request Jul 27, 2023
…in the `PLAN_ID_TAG`

### What changes were proposed in this pull request?
Make rule `ExtractWindowExpressions` retain the `PLAN_ID_TAG `

### Why are the changes needed?
In #39925, we introduced a new mechanism to resolve expression with specified plan.

However, sometimes the plan ID might be discarded by some analyzer rules, and then some expressions can not be correctly resolved, this issue is the main blocker of PS on Connect.

### Does this PR introduce _any_ user-facing change?
yes, a lot of Pandas APIs enabled

### How was this patch tested?
Enable UTs

Closes #42086 from zhengruifeng/ps_connect_analyze_window.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
itholic pushed a commit to itholic/spark that referenced this pull request Aug 1, 2023
…in the `PLAN_ID_TAG`

### What changes were proposed in this pull request?
Make rule `ExtractWindowExpressions` retain the `PLAN_ID_TAG `

### Why are the changes needed?
In apache#39925, we introduced a new mechanism to resolve expression with specified plan.

However, sometimes the plan ID might be discarded by some analyzer rules, and then some expressions can not be correctly resolved, this issue is the main blocker of PS on Connect.

### Does this PR introduce _any_ user-facing change?
yes, a lot of Pandas APIs enabled

### How was this patch tested?
Enable UTs

Closes apache#42086 from zhengruifeng/ps_connect_analyze_window.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants