[WIP][SPARK-41812][SPARK-41823][CONNECT][PYTHON] Fix ambiguous columns issue in Join #39734

zhengruifeng · 2023-01-25T06:32:13Z

What changes were proposed in this pull request?

PySpark's DataFrame.__getattr__ and DataFrame.__getitem__ invokes jc = self._jdf.apply(name) in JVM, which resolve the column name and attach the dataframe id via addDataFrameIdToCol to handle ambiguous columns , see

spark/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

Lines 1472 to 1481 in faedcd9

    
           def col(colName: String): Column = colName match { 
        
             case "*" => 
        
               Column(ResolvedStar(queryExecution.analyzed.output)) 
        
             case _ => 
        
               if (sqlContext.conf.supportQuotedRegexColumnName) { 
        
                 colRegex(colName) 
        
               } else { 
        
                 Column(addDataFrameIdToCol(resolve(colName))) 
        
               } 
        
           }

But in Connect, the output of DataFrame.__getattr__ and DataFrame.__getitem__ is not bound to the input DataFrame, it is just an UnresolvedAttribute.

This PR aims to fix this issue by switching to the DataFrame API based implementation.

Why are the changes needed?

for parity

Does this PR introduce any user-facing change?

yes

How was this patch tested?

enabled doctests and added UT

zhengruifeng · 2023-01-25T06:33:29Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

currently, both a == a and a <=> a are taken into account

zhengruifeng · 2023-01-25T06:37:34Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

we must use DataFrame.join here to make sure the DataFrame ID will not be changed.

spark/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

Lines 202 to 203 in faedcd9

// A globally unique id of this Dataset.

private val id = Dataset.curId.getAndIncrement()

zhengruifeng · 2023-01-25T07:03:26Z

...ct/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectPlannerSuite.scala

after this PR, Join with JoinCondition will be eagerly analyzed.

zhengruifeng · 2023-01-27T04:39:43Z

@HyukjinKwon @cloud-fan @grundprinzip

zhengruifeng · 2023-01-27T06:46:30Z

sorry, this PR can't fix chained operators like df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height), it need more investigation

cloud-fan · 2023-01-29T14:52:34Z

unless we introduce expr id in connect, I don't think we can solve this problem.

zhengruifeng · 2023-02-07T08:25:48Z

close this in favor of #39925

github-actions bot added CONNECT CORE PYTHON SQL labels Jan 25, 2023

zhengruifeng commented Jan 25, 2023

View reviewed changes

zhengruifeng changed the title ~~[SPARK-41812][SPARK-41823][CONNECT][PYTHON] Fix ambiguous columns in Join~~ [SPARK-41812][SPARK-41823][CONNECT][PYTHON] Fix ambiguous columns issue in Join Jan 25, 2023

zhengruifeng changed the title ~~[SPARK-41812][SPARK-41823][CONNECT][PYTHON] Fix ambiguous columns issue in Join~~ [WIP][SPARK-41812][SPARK-41823][CONNECT][PYTHON] Fix ambiguous columns issue in Join Jan 25, 2023

zhengruifeng changed the title ~~[WIP][SPARK-41812][SPARK-41823][CONNECT][PYTHON] Fix ambiguous columns issue in Join~~ [SPARK-41812][SPARK-41823][CONNECT][PYTHON] Fix ambiguous columns issue in Join Jan 25, 2023

zhengruifeng commented Jan 25, 2023

View reviewed changes

zhengruifeng added 2 commits January 25, 2023 18:52

init

a87b6f0

fix scala test

0d56751

zhengruifeng force-pushed the connect_ambiguous_cols branch from 33a3636 to 0d56751 Compare January 25, 2023 10:52

zhengruifeng marked this pull request as draft January 27, 2023 06:46

zhengruifeng changed the title ~~[SPARK-41812][SPARK-41823][CONNECT][PYTHON] Fix ambiguous columns issue in Join~~ [WIP][SPARK-41812][SPARK-41823][CONNECT][PYTHON] Fix ambiguous columns issue in Join Jan 27, 2023

zhengruifeng closed this Feb 7, 2023

zhengruifeng deleted the connect_ambiguous_cols branch February 7, 2023 08:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][SPARK-41812][SPARK-41823][CONNECT][PYTHON] Fix ambiguous columns issue in Join #39734

[WIP][SPARK-41812][SPARK-41823][CONNECT][PYTHON] Fix ambiguous columns issue in Join #39734

Uh oh!

zhengruifeng commented Jan 25, 2023

Uh oh!

zhengruifeng Jan 25, 2023 •

edited

Loading

Uh oh!

zhengruifeng Jan 25, 2023 •

edited

Loading

Uh oh!

zhengruifeng Jan 25, 2023

Uh oh!

zhengruifeng commented Jan 27, 2023

Uh oh!

zhengruifeng commented Jan 27, 2023

Uh oh!

cloud-fan commented Jan 29, 2023

Uh oh!

zhengruifeng commented Feb 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	def col(colName: String): Column = colName match {
	case "*" =>
	Column(ResolvedStar(queryExecution.analyzed.output))
	case _ =>
	if (sqlContext.conf.supportQuotedRegexColumnName) {
	colRegex(colName)
	} else {
	Column(addDataFrameIdToCol(resolve(colName)))
	}
	}

	// A globally unique id of this Dataset.
	private val id = Dataset.curId.getAndIncrement()

[WIP][SPARK-41812][SPARK-41823][CONNECT][PYTHON] Fix ambiguous columns issue in Join #39734

[WIP][SPARK-41812][SPARK-41823][CONNECT][PYTHON] Fix ambiguous columns issue in Join #39734

Uh oh!

Conversation

zhengruifeng commented Jan 25, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zhengruifeng Jan 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Jan 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Jan 25, 2023

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Jan 27, 2023

Uh oh!

zhengruifeng commented Jan 27, 2023

Uh oh!

cloud-fan commented Jan 29, 2023

Uh oh!

zhengruifeng commented Feb 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhengruifeng Jan 25, 2023 •

edited

Loading

zhengruifeng Jan 25, 2023 •

edited

Loading