-
Notifications
You must be signed in to change notification settings - Fork 29k
[WIP][SPARK-41812][SPARK-41823][CONNECT][PYTHON] Fix ambiguous columns issue in Join #39734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently, both a == a and a <=> a are taken into account
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we must use DataFrame.join here to make sure the DataFrame ID will not be changed.
spark/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
Lines 202 to 203 in faedcd9
| // A globally unique id of this Dataset. | |
| private val id = Dataset.curId.getAndIncrement() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after this PR, Join with JoinCondition will be eagerly analyzed.
33a3636 to
0d56751
Compare
|
sorry, this PR can't fix chained operators like |
|
unless we introduce expr id in connect, I don't think we can solve this problem. |
|
close this in favor of #39925 |
What changes were proposed in this pull request?
PySpark's
DataFrame.__getattr__andDataFrame.__getitem__invokesjc = self._jdf.apply(name)in JVM, which resolve the column name and attach the dataframe id viaaddDataFrameIdToColto handle ambiguous columns , seespark/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
Lines 1472 to 1481 in faedcd9
But in Connect, the output of
DataFrame.__getattr__andDataFrame.__getitem__is not bound to the inputDataFrame, it is just anUnresolvedAttribute.This PR aims to fix this issue by switching to the DataFrame API based implementation.
Why are the changes needed?
for parity
Does this PR introduce any user-facing change?
yes
How was this patch tested?
enabled doctests and added UT