-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-44120][PYTHON] Support Python 3.12 #43184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
96ad725 to
f95f3bf
Compare
|
Could you review this, @LuciferYang ? |
|
It seems that some side-effects on the installation. Let me check this first. I converted this PR to the |
42ab0c3 to
38436a4
Compare
|
Hi, @HyukjinKwon . Could you review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh actually we can't just add a required dependency easily. It works with pip install but it wouldn't work when users download Spark from the official website (as they have to manually install packaging dependency in their nodes) - in case of Py4J, we already contain that in our release so it works.
We should either port packaging into our release, or find a workaround to avoid adding the required dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can take a look around (mid) next week too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, currently, it should be installed manually like Numpy.
However, it's independent from supporting Python 3.12 itself.
Can we do that porting after this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can proceed to port packaging tomorrow separately too because USA doesn't have Chuseok, @HyukjinKwon . :)
|
Could you review this PR, @viirya ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the purpose of Python 3.12 support, this change looks okay.
Regarding packaging issue. I'm wondering how do we treat other dependencies like pyarrow, pandas, etc.? They are required for certain functions but I think they are not included in the distribution too. So for users who download Spark from the official website, they are required to install these dependencies by themselves, right?
From this change, looks like the usage of packaging is limited to pandas and sql (connect and pandas related) modules, cannot we also list it as required dependencies for certain modules, just like how we treat pyarrow, pandas, etc.? Is it a generally required dependency like py4j? It seems not, from the list of pyspark change here.
I suppose that you can run PySpark from downloaded distribution without packaging if not touching connect and pandas functions. If so, it doesn't looks like a special thing like py4j, but more like pandas, pyarrow which we list as conditionally required dependencies.
|
Thank you for review, @viirya . You are right. We consider them optional, @viirya .
Actually,
https://github.com/apache/spark/blob/master/python/lib/py4j-0.10.9.7-src.zip However, embedding is not a recommendation from the official Python community. So, I didn't do that in this PR. I'll handle that usability issue as an independent JIRA after this PR. |
From the error shown in the description for This is only triggerd if pandas/pyarrow is installed/enabled. As you can see it is under |
|
Although you are right, it should work in that way. However, the technical problem is that Apache PySpark code doesn't have a conditional check in SparkSession properly. Here is the error message when we don't have both pandas and packaging package. |
|
Let me follow your advice, @viirya . I'm trying to add a conditional import on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's my phone so please ignore this comment if it's wrong. If we use distutils only for LooseVersion, we could just have our own LooseVersion in PySpark too (instead of embedding the package)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also sounds good if LooseVersion is easy to port into Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not difficult to copy or to reimplement, @HyukjinKwon and @viirya .
However, it's a PSF License which is different from Py4J or cloudpickle (under BSD 3-Clause),
Lines 460 to 462 in 58c24a5
| python/lib/py4j-*-src.zip | |
| python/pyspark/cloudpickle.py | |
| python/pyspark/join.py |
I believe it's compatible but we need to take a look once more before doing that. It's because our Apache Spark (up to 3.5.0) binary distribution doesn't include Python Software Foundation yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could just write our own instead of copying. It's just a version string check in the end. I have not seen their code yet so I can write one on my own too :-).
|
According to the review comment, I made a separate PR, @HyukjinKwon and @viirya |
### What changes were proposed in this pull request?
This PR aims to remove `distutils` usage from Spark codebase.
**BEFORE**
```
$ git grep distutils | wc -l
38
```
**AFTER**
```
$ git grep distutils | wc -l
0
```
### Why are the changes needed?
Currently, Apache Spark ignores the warnings but the module itself is removed at Python 3.12 via [PEP-632](https://peps.python.org/pep-0632) in favor of `packaging` package.
https://github.com/apache/spark/blob/58c24a5719b8717ea37347c668c9df8a3714ae3c/python/pyspark/__init__.py#L54-L56
Initially, #43184 proposed to follow Python community guideline via using `packaging` package, but, this PR is embedding `LooseVersion` Python class to avoid adding a new package requirement.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #43192 from dongjoon-hyun/remove_distutils.
Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
38436a4 to
6b51647
Compare
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to run Python 3.12 test in CI?
Not yet. (1) Python 3.12 is not released yet. It will be released Tomorrow. We can add Python 3.12 to |
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The work to make PySpark work with Python 3.12 is actually done in #43192. Looks good to me as this adds Python 3.12 to the file. We can deal with CI later after it is released.
|
Thank you, @viirya and all. |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
… Python 12 ### What changes were proposed in this pull request? This PR aims to use `17-jammy` tag instead of `17` to prevent Python 12. ### Why are the changes needed? Two days ago, `eclipse-temurin:17` switched its baseline OS to `Ubuntu 24.04` which brings `Python 3.12`. ``` $ docker run -it --rm eclipse-temurin:17 cat /etc/os-release | grep VERSION_ID VERSION_ID="24.04" $ docker run -it --rm eclipse-temurin:17-jammy cat /etc/os-release | grep VERSION_ID VERSION_ID="22.04" ``` Since Python 3.12 supported is added only to Apache Spark 4.0.0, we need to keep using the previous OS, `Ubuntu 22.04`. - #43184 - #43192 ### Does this PR introduce _any_ user-facing change? No. This aims to recover to the same OS for consistent behavior. ### How was this patch tested? Pass the CIs with K8s IT. Currently, it's broken at Python image building phase. - https://github.com/apache/spark/actions/workflows/build_branch35.yml ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47488 from dongjoon-hyun/SPARK-49005. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…vent Python 3.12 ### What changes were proposed in this pull request? This PR aims to use `17-jammy` tag instead of `17-jre` to prevent Python 12. ### Why are the changes needed? Two days ago, `eclipse-temurin:17` switched its baseline OS to `Ubuntu 24.04` which brings `Python 3.12`. ``` $ docker run -it --rm eclipse-temurin:17-jre cat /etc/os-release | grep VERSION_ID VERSION_ID="24.04" $ docker run -it --rm eclipse-temurin:17-jammy cat /etc/os-release | grep VERSION_ID VERSION_ID="22.04" ``` Since Python 3.12 supported is added only to Apache Spark 4.0.0, we need to keep using the previous OS, `Ubuntu 22.04`. - #43184 - #43192 ### Does this PR introduce _any_ user-facing change? No. This aims to recover to the same OS for consistent behavior. ### How was this patch tested? Pass the CIs with K8s IT. Currently, it's broken at Python image building phase. - https://github.com/apache/spark/actions/workflows/build_branch34.yml ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47489 from dongjoon-hyun/SPARK-49005-3.4. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

What changes were proposed in this pull request?
This PR aims to support Python 3.12 in Apache Spark 4.0.0.
Note that this is tested with NumPy 1.26 and Pandas 2.1.1, but without
PyArrowbecause it doesn't support Python 3.12 yet.PyArrow is still an optional component for Spark SQL and I believe it will support Python 3.12 soon.
Why are the changes needed?
Python 3.12 release will happen in a few days on 2023-10-02.
Does this PR introduce any user-facing change?
No. This is a new addition to PySpark support environment.
How was this patch tested?
Pass the CIs for old Pythons and run manual test with Python 3.12.
Was this patch authored or co-authored using generative AI tooling?
No.