Skip to content

[SPARK-38800][DOCS][PYTHON] Explicitly document the supported pandas version. #36095

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Apr 7, 2022

What changes were proposed in this pull request?

This PR proposes to document the supported pandas version for pandas API on Spark.

Why are the changes needed?

Since the behavior of pandas is different per its version, it would be better explicitly documenting the supported pandas version so that users won't confuse.

Does this PR introduce any user-facing change?

Yes, now the supported pandas version is mentioned to the PySpark API reference page as below:

Screen Shot 2022-04-08 at 10 35 28 AM

How was this patch tested?

This existing doc build should cover

@github-actions github-actions bot added the PYTHON label Apr 7, 2022
@@ -47,6 +47,8 @@ With this package, you can:
* Have a single codebase that works both with pandas (tests, smaller datasets) and with Spark (distributed datasets).
* Switch to pandas API and PySpark API contexts easily without any overhead.

Note that pandas has different behavior per its version, and pandas API on Spark tries to match the behavior of pandas 1.4.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add this in https://spark.apache.org/docs/latest/api/python/reference/index.html page. Also pandas 1.4. -> latest pandas release.

We should specify the version only in other branches.

Copy link
Contributor Author

@itholic itholic Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, then can I create other PRs for branch-3.2 and branch-3.3 separately ?

I believe that branch-3.2 supports pandas 1.3 and branch-3.3 supports pandas 1.4.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikun#86

FYI, there are some test failed with pandas 1.4.x, I haven't got enough time to find all errors reason yet, some of thems are some panda bugs in 1.4.x (such as pandas-dev/pandas#46589). we only tested 1.3.x with github action ci.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created SPARK-38819 to track the issue on Run Pandas on Spark with Pandas 1.4.x.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yikun Thanks for the testing!! Let me help taking a look

Copy link
Member

@Yikun Yikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on master should doc support version to panda latest.

@bjornjorgensen
Copy link
Contributor

I use Gmail to check English. If you start a new email in Gmail and you paste the sentence in, you will see that a "the" is missing.

Note that pandas has different behavior per its version, and pandas API on Spark tries to match the behavior of the latest pandas release.

@itholic
Copy link
Contributor Author

itholic commented Apr 7, 2022

Thanks, @bjornjorgensen for the tip 👍 Just corrected.

@HyukjinKwon
Copy link
Member

Merged to master.

@HyukjinKwon
Copy link
Member

@itholic mind creating a PR for branch-3.3?

@itholic
Copy link
Contributor Author

itholic commented Apr 8, 2022

@itholic mind creating a PR for branch-3.3?

Just created #36114

dongjoon-hyun pushed a commit that referenced this pull request Apr 11, 2022
…ndas version

### What changes were proposed in this pull request?

This PR proposes to document the supported pandas version for pandas API on Spark.

#36095 is corresponding PR for master branch.

### Why are the changes needed?

Since the behavior of pandas is different per its version, it would be better explicitly documenting the supported pandas version so that users won't confuse.

pandas API on Spark aims matching behavior to pandas 1.3.

### Does this PR introduce _any_ user-facing change?

Yes, now the supported pandas version is mentioned to the PySpark API reference page.

### How was this patch tested?

This existing doc build should cover

Closes #36114 from itholic/SPARK-38800-3.3.

Authored-by: itholic <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@itholic itholic deleted the SPARK-38800/supported_pandas_version branch April 22, 2023 05:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants