Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

PySpark's DataFrame.__getattr__ and DataFrame.__getitem__ invokes jc = self._jdf.apply(name) in JVM, which resolve the column name and attach the dataframe id via addDataFrameIdToCol to handle ambiguous columns , see

def col(colName: String): Column = colName match {
case "*" =>
Column(ResolvedStar(queryExecution.analyzed.output))
case _ =>
if (sqlContext.conf.supportQuotedRegexColumnName) {
colRegex(colName)
} else {
Column(addDataFrameIdToCol(resolve(colName)))
}
}

But in Connect, the output of DataFrame.__getattr__ and DataFrame.__getitem__ is not bound to the input DataFrame, it is just an UnresolvedAttribute.

This PR aims to fix this issue by switching to the DataFrame API based implementation.

Why are the changes needed?

for parity

Does this PR introduce any user-facing change?

yes

How was this patch tested?

enabled doctests and added UT

Copy link
Contributor Author

@zhengruifeng zhengruifeng Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, both a == a and a <=> a are taken into account

Copy link
Contributor Author

@zhengruifeng zhengruifeng Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we must use DataFrame.join here to make sure the DataFrame ID will not be changed.

// A globally unique id of this Dataset.
private val id = Dataset.curId.getAndIncrement()

@zhengruifeng zhengruifeng changed the title [SPARK-41812][SPARK-41823][CONNECT][PYTHON] Fix ambiguous columns in Join [SPARK-41812][SPARK-41823][CONNECT][PYTHON] Fix ambiguous columns issue in Join Jan 25, 2023
@zhengruifeng zhengruifeng changed the title [SPARK-41812][SPARK-41823][CONNECT][PYTHON] Fix ambiguous columns issue in Join [WIP][SPARK-41812][SPARK-41823][CONNECT][PYTHON] Fix ambiguous columns issue in Join Jan 25, 2023
@zhengruifeng zhengruifeng changed the title [WIP][SPARK-41812][SPARK-41823][CONNECT][PYTHON] Fix ambiguous columns issue in Join [SPARK-41812][SPARK-41823][CONNECT][PYTHON] Fix ambiguous columns issue in Join Jan 25, 2023
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after this PR, Join with JoinCondition will be eagerly analyzed.

@zhengruifeng zhengruifeng force-pushed the connect_ambiguous_cols branch from 33a3636 to 0d56751 Compare January 25, 2023 10:52
@zhengruifeng
Copy link
Contributor Author

@zhengruifeng
Copy link
Contributor Author

sorry, this PR can't fix chained operators like df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height), it need more investigation

@zhengruifeng zhengruifeng marked this pull request as draft January 27, 2023 06:46
@zhengruifeng zhengruifeng changed the title [SPARK-41812][SPARK-41823][CONNECT][PYTHON] Fix ambiguous columns issue in Join [WIP][SPARK-41812][SPARK-41823][CONNECT][PYTHON] Fix ambiguous columns issue in Join Jan 27, 2023
@cloud-fan
Copy link
Contributor

unless we introduce expr id in connect, I don't think we can solve this problem.

@zhengruifeng
Copy link
Contributor Author

close this in favor of #39925

@zhengruifeng zhengruifeng deleted the connect_ambiguous_cols branch February 7, 2023 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants