[SPARK-44120][PYTHON] Support Python 3.12 #43184

dongjoon-hyun · 2023-09-29T22:03:14Z

What changes were proposed in this pull request?

This PR aims to support Python 3.12 in Apache Spark 4.0.0.

Note that this is tested with NumPy 1.26 and Pandas 2.1.1, but without PyArrow because it doesn't support Python 3.12 yet.

https://arrow.apache.org/docs/python/install.html#python-compatibility

PyArrow is currently compatible with Python 3.8, 3.9, 3.10 and 3.11.

PyArrow is still an optional component for Spark SQL and I believe it will support Python 3.12 soon.

Why are the changes needed?

Python 3.12 release will happen in a few days on 2023-10-02.

https://www.python.org/downloads/

Does this PR introduce any user-facing change?

No. This is a new addition to PySpark support environment.

How was this patch tested?

Pass the CIs for old Pythons and run manual test with Python 3.12.

$ python/run-tests.py --python-executables python3.12
Running PySpark tests. Output is in /Users/dongjoon/PRS/SPARK-44120/python/unit-tests.log
Will test against the following Python executables: ['python3.12']
Will test the following Python modules: ['pyspark-connect', 'pyspark-core', 'pyspark-errors', 'pyspark-ml', 'pyspark-ml-connect', 'pyspark-mllib', 'pyspark-pandas', 'pyspark-pandas-connect-part0', 'pyspark-pandas-connect-part1', 'pyspark-pandas-connect-part2', 'pyspark-pandas-connect-part3', 'pyspark-pandas-slow', 'pyspark-resource', 'pyspark-sql', 'pyspark-streaming', 'pyspark-testing']
python3.12 python_implementation is CPython
python3.12 version is: Python 3.12.0rc2
...
Finished test(python3.12): pyspark.sql.functions (70s)
Finished test(python3.12): pyspark.sql.streaming.readwriter (77s)
Tests passed in 398 seconds

Skipped tests in pyspark.ml.tests.connect.test_parity_torch_data_loader with python3.12:
    test_data_loader (pyspark.ml.tests.connect.test_parity_torch_data_loader.TorchDistributorBaselineUnitTestsOnConnect.test_data_loader) ... skipped 'torch is required'
    test_data_loader (pyspark.ml.torch.tests.test_data_loader.TorchDistributorDataLoaderUnitTests.test_data_loader) ... skipped 'torch is required'

Skipped tests in pyspark.ml.torch.tests.test_data_loader with python3.12:
    test_data_loader (pyspark.ml.torch.tests.test_data_loader.TorchDistributorDataLoaderUnitTests.test_data_loader) ... skipped 'torch is required'

Was this patch authored or co-authored using generative AI tooling?

No.

dongjoon-hyun · 2023-09-30T02:47:41Z

All Python tests passed.

dongjoon-hyun · 2023-09-30T03:37:19Z

Could you review this, @LuciferYang ?

dongjoon-hyun · 2023-09-30T04:20:01Z

It seems that some side-effects on the installation.

[info] org.apache.spark.sql.SQLQueryTestSuite *** ABORTED *** (6 milliseconds)
[info]   java.lang.RuntimeException: Python executable [python3] and/or pyspark are unavailable.

Let me check this first. I converted this PR to the Draft for that.

dongjoon-hyun · 2023-09-30T11:45:11Z

Hi, @HyukjinKwon . Could you review this PR?

HyukjinKwon · 2023-09-30T11:52:25Z

python/pyspark/pandas/plot/matplotlib.py

Oh actually we can't just add a required dependency easily. It works with pip install but it wouldn't work when users download Spark from the official website (as they have to manually install packaging dependency in their nodes) - in case of Py4J, we already contain that in our release so it works.

We should either port packaging into our release, or find a workaround to avoid adding the required dependency.

I can take a look around (mid) next week too.

Yes, currently, it should be installed manually like Numpy.

However, it's independent from supporting Python 3.12 itself.

Can we do that porting after this?

I can proceed to port packaging tomorrow separately too because USA doesn't have Chuseok, @HyukjinKwon . :)

dongjoon-hyun · 2023-09-30T18:46:32Z

Could you review this PR, @viirya ?

viirya

For the purpose of Python 3.12 support, this change looks okay.

Regarding packaging issue. I'm wondering how do we treat other dependencies like pyarrow, pandas, etc.? They are required for certain functions but I think they are not included in the distribution too. So for users who download Spark from the official website, they are required to install these dependencies by themselves, right?

From this change, looks like the usage of packaging is limited to pandas and sql (connect and pandas related) modules, cannot we also list it as required dependencies for certain modules, just like how we treat pyarrow, pandas, etc.? Is it a generally required dependency like py4j? It seems not, from the list of pyspark change here.

I suppose that you can run PySpark from downloaded distribution without packaging if not touching connect and pandas functions. If so, it doesn't looks like a special thing like py4j, but more like pandas, pyarrow which we list as conditionally required dependencies.

dongjoon-hyun · 2023-09-30T19:42:57Z

Thank you for review, @viirya .

You are right. We consider them optional, @viirya .

So for users who download Spark from the official website, they are required to install these dependencies by themselves, right?

Actually, pyspark shell fails to start. So, we need to embed packaging like Py4J.

I suppose that you can run PySpark from downloaded distribution without packaging if not touching connect and pandas functions.

https://github.com/apache/spark/blob/master/python/lib/py4j-0.10.9.7-src.zip

However, embedding is not a recommendation from the official Python community. So, I didn't do that in this PR. I'll handle that usability issue as an independent JIRA after this PR.

viirya · 2023-09-30T19:47:34Z

Actually, pyspark shell fails to start. So, we need to embed packaging like Py4J.

From the error shown in the description for distutils, I guess you mean pyspark shell fails to start at same location if packaging is not installed?

    from pyspark.sql.pandas.conversion import PandasConversionMixin
  File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/sql/pandas/conversion.py", line 29, in <module>
    from distutils.version import LooseVersion
ModuleNotFoundError: No module named 'distutils'

This is only triggerd if pandas/pyarrow is installed/enabled. As you can see it is under sql/pandas, if you don't have pandas installed, I think it won't run into there.

dongjoon-hyun · 2023-09-30T20:09:09Z

Although you are right, it should work in that way. However, the technical problem is that Apache PySpark code doesn't have a conditional check in SparkSession properly. Here is the error message when we don't have both pandas and packaging package.

$ bin/pyspark
Python 3.12.0rc2 (main, Sep 21 2023, 21:22:29) [Clang 14.0.0 (clang-1400.0.28.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
  File "/Users/dongjoon/PRS/SPARK-44120/python/pyspark/shell.py", line 31, in <module>
    import pyspark
  File "/Users/dongjoon/PRS/SPARK-44120/python/pyspark/__init__.py", line 148, in <module>
    from pyspark.sql import SQLContext, HiveContext, Row  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dongjoon/PRS/SPARK-44120/python/pyspark/sql/__init__.py", line 43, in <module>
    from pyspark.sql.context import SQLContext, HiveContext, UDFRegistration, UDTFRegistration
  File "/Users/dongjoon/PRS/SPARK-44120/python/pyspark/sql/context.py", line 39, in <module>
    from pyspark.sql.session import _monkey_patch_RDD, SparkSession
  File "/Users/dongjoon/PRS/SPARK-44120/python/pyspark/sql/session.py", line 47, in <module>
    from pyspark.sql.dataframe import DataFrame
  File "/Users/dongjoon/PRS/SPARK-44120/python/pyspark/sql/dataframe.py", line 64, in <module>
    from pyspark.sql.pandas.conversion import PandasConversionMixin
  File "/Users/dongjoon/PRS/SPARK-44120/python/pyspark/sql/pandas/conversion.py", line 29, in <module>
    from packaging.version import Version
ModuleNotFoundError: No module named 'packaging'

dongjoon-hyun · 2023-09-30T20:13:57Z

Let me follow your advice, @viirya . I'm trying to add a conditional import on ./python/pyspark/sql/pandas/conversion.py.

HyukjinKwon · 2023-09-30T20:56:51Z

python/pyspark/pandas/tests/indexes/test_base.py

It's my phone so please ignore this comment if it's wrong. If we use distutils only for LooseVersion, we could just have our own LooseVersion in PySpark too (instead of embedding the package)

It also sounds good if LooseVersion is easy to port into Spark.

It's not difficult to copy or to reimplement, @HyukjinKwon and @viirya .

However, it's a PSF License which is different from Py4J or cloudpickle (under BSD 3-Clause),

spark/LICENSE-binary

Lines 460 to 462 in 58c24a5

python/lib/py4j-*-src.zip

python/pyspark/cloudpickle.py

python/pyspark/join.py

I believe it's compatible but we need to take a look once more before doing that. It's because our Apache Spark (up to 3.5.0) binary distribution doesn't include Python Software Foundation yet.

We could just write our own instead of copying. It's just a version string check in the end. I have not seen their code yet so I can write one on my own too :-).

dongjoon-hyun · 2023-10-01T01:46:28Z

According to the review comment, I made a separate PR, @HyukjinKwon and @viirya

[SPARK-45390][PYTHON] Remove distutils usage #43192

### What changes were proposed in this pull request? This PR aims to remove `distutils` usage from Spark codebase. **BEFORE** ``` $ git grep distutils | wc -l 38 ``` **AFTER** ``` $ git grep distutils | wc -l 0 ``` ### Why are the changes needed? Currently, Apache Spark ignores the warnings but the module itself is removed at Python 3.12 via [PEP-632](https://peps.python.org/pep-0632) in favor of `packaging` package. https://github.com/apache/spark/blob/58c24a5719b8717ea37347c668c9df8a3714ae3c/python/pyspark/__init__.py#L54-L56 Initially, #43184 proposed to follow Python community guideline via using `packaging` package, but, this PR is embedding `LooseVersion` Python class to avoid adding a new package requirement. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43192 from dongjoon-hyun/remove_distutils. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

viirya

Do you want to run Python 3.12 test in CI?

dongjoon-hyun · 2023-10-01T18:00:45Z

Do you want to run Python 3.12 test in CI?

Not yet. (1) Python 3.12 is not released yet. It will be released Tomorrow. We can add Python 3.12 to actions/setup-python GitHub Action CI later. (2) We may need to add a separate daily pipeline instead of the main PR builder because the AS-IS PySpark (PySpark+Pandas+Connect) tests require too much resources.

https://www.python.org/downloads/

viirya

The work to make PySpark work with Python 3.12 is actually done in #43192. Looks good to me as this adds Python 3.12 to the file. We can deal with CI later after it is released.

dongjoon-hyun · 2023-10-01T19:39:16Z

Thank you, @viirya and all.
Merged to master for Apache Spark 4.0.0.

HyukjinKwon

LGTM!

… Python 12 ### What changes were proposed in this pull request? This PR aims to use `17-jammy` tag instead of `17` to prevent Python 12. ### Why are the changes needed? Two days ago, `eclipse-temurin:17` switched its baseline OS to `Ubuntu 24.04` which brings `Python 3.12`. ``` $ docker run -it --rm eclipse-temurin:17 cat /etc/os-release | grep VERSION_ID VERSION_ID="24.04" $ docker run -it --rm eclipse-temurin:17-jammy cat /etc/os-release | grep VERSION_ID VERSION_ID="22.04" ``` Since Python 3.12 supported is added only to Apache Spark 4.0.0, we need to keep using the previous OS, `Ubuntu 22.04`. - #43184 - #43192 ### Does this PR introduce _any_ user-facing change? No. This aims to recover to the same OS for consistent behavior. ### How was this patch tested? Pass the CIs with K8s IT. Currently, it's broken at Python image building phase. - https://github.com/apache/spark/actions/workflows/build_branch35.yml ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47488 from dongjoon-hyun/SPARK-49005. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…vent Python 3.12 ### What changes were proposed in this pull request? This PR aims to use `17-jammy` tag instead of `17-jre` to prevent Python 12. ### Why are the changes needed? Two days ago, `eclipse-temurin:17` switched its baseline OS to `Ubuntu 24.04` which brings `Python 3.12`. ``` $ docker run -it --rm eclipse-temurin:17-jre cat /etc/os-release | grep VERSION_ID VERSION_ID="24.04" $ docker run -it --rm eclipse-temurin:17-jammy cat /etc/os-release | grep VERSION_ID VERSION_ID="22.04" ``` Since Python 3.12 supported is added only to Apache Spark 4.0.0, we need to keep using the previous OS, `Ubuntu 22.04`. - #43184 - #43192 ### Does this PR introduce _any_ user-facing change? No. This aims to recover to the same OS for consistent behavior. ### How was this patch tested? Pass the CIs with K8s IT. Currently, it's broken at Python image building phase. - https://github.com/apache/spark/actions/workflows/build_branch34.yml ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47489 from dongjoon-hyun/SPARK-49005-3.4. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

github-actions bot added SQL PYTHON PANDAS API ON SPARK CONNECT BUILD labels Sep 29, 2023

dongjoon-hyun marked this pull request as draft September 29, 2023 23:36

dongjoon-hyun force-pushed the SPARK-44120 branch from 96ad725 to f95f3bf Compare September 29, 2023 23:44

dongjoon-hyun marked this pull request as ready for review September 29, 2023 23:45

dongjoon-hyun marked this pull request as draft September 30, 2023 04:18

github-actions bot added the INFRA label Sep 30, 2023

dongjoon-hyun force-pushed the SPARK-44120 branch from 42ab0c3 to 38436a4 Compare September 30, 2023 07:04

dongjoon-hyun marked this pull request as ready for review September 30, 2023 07:04

github-actions bot added the KUBERNETES label Sep 30, 2023

HyukjinKwon reviewed Sep 30, 2023

View reviewed changes

viirya reviewed Sep 30, 2023

View reviewed changes

HyukjinKwon reviewed Sep 30, 2023

View reviewed changes

dongjoon-hyun mentioned this pull request Oct 1, 2023

[SPARK-45390][PYTHON] Remove distutils usage #43192

Closed

[SPARK-44120][PYTHON] Support Python 3.12

6b51647

dongjoon-hyun force-pushed the SPARK-44120 branch from 38436a4 to 6b51647 Compare October 1, 2023 17:54

github-actions bot removed SQL KUBERNETES BUILD INFRA PANDAS API ON SPARK CONNECT labels Oct 1, 2023

viirya reviewed Oct 1, 2023

View reviewed changes

viirya approved these changes Oct 1, 2023

View reviewed changes

dongjoon-hyun closed this in 3d23171 Oct 1, 2023

HyukjinKwon reviewed Oct 2, 2023

View reviewed changes

This was referenced Jul 25, 2024

[SPARK-49005][K8S][3.5] Use 17-jammy tag instead of 17 to prevent Python 3.12 #47488

Closed

[SPARK-49005][K8S][3.4] Use 17-jammy tag instead of 17-jre to prevent Python 3.12 #47489

Closed

dongjoon-hyun deleted the SPARK-44120 branch November 16, 2025 16:47

	python/lib/py4j-*-src.zip
	python/pyspark/cloudpickle.py
	python/pyspark/join.py

[SPARK-44120][PYTHON] Support Python 3.12 #43184

[SPARK-44120][PYTHON] Support Python 3.12 #43184

Uh oh!

Conversation

dongjoon-hyun commented Sep 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun commented Sep 30, 2023

Uh oh!

dongjoon-hyun commented Sep 30, 2023

Uh oh!

dongjoon-hyun commented Sep 30, 2023

Uh oh!

dongjoon-hyun commented Sep 30, 2023

Uh oh!

HyukjinKwon Sep 30, 2023

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Sep 30, 2023

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Sep 30, 2023

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Sep 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Sep 30, 2023

Uh oh!

viirya left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Sep 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Sep 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Sep 30, 2023

Uh oh!

dongjoon-hyun commented Sep 30, 2023

Uh oh!

HyukjinKwon Sep 30, 2023

Choose a reason for hiding this comment

Uh oh!

viirya Sep 30, 2023

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Oct 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Oct 1, 2023

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Oct 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Oct 1, 2023

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

dongjoon-hyun commented Sep 29, 2023 •

edited

Loading

dongjoon-hyun Sep 30, 2023 •

edited

Loading

viirya left a comment •

edited

Loading

dongjoon-hyun commented Sep 30, 2023 •

edited

Loading

viirya commented Sep 30, 2023 •

edited

Loading

dongjoon-hyun Oct 1, 2023 •

edited

Loading

HyukjinKwon Oct 1, 2023 •

edited

Loading

dongjoon-hyun commented Oct 1, 2023 •

edited

Loading