Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Sep 29, 2023

What changes were proposed in this pull request?

This PR aims to support Python 3.12 in Apache Spark 4.0.0.

Note that this is tested with NumPy 1.26 and Pandas 2.1.1, but without PyArrow because it doesn't support Python 3.12 yet.

PyArrow is currently compatible with Python 3.8, 3.9, 3.10 and 3.11.

PyArrow is still an optional component for Spark SQL and I believe it will support Python 3.12 soon.

Why are the changes needed?

Python 3.12 release will happen in a few days on 2023-10-02.

Does this PR introduce any user-facing change?

No. This is a new addition to PySpark support environment.

How was this patch tested?

Pass the CIs for old Pythons and run manual test with Python 3.12.

$ python/run-tests.py --python-executables python3.12
Running PySpark tests. Output is in /Users/dongjoon/PRS/SPARK-44120/python/unit-tests.log
Will test against the following Python executables: ['python3.12']
Will test the following Python modules: ['pyspark-connect', 'pyspark-core', 'pyspark-errors', 'pyspark-ml', 'pyspark-ml-connect', 'pyspark-mllib', 'pyspark-pandas', 'pyspark-pandas-connect-part0', 'pyspark-pandas-connect-part1', 'pyspark-pandas-connect-part2', 'pyspark-pandas-connect-part3', 'pyspark-pandas-slow', 'pyspark-resource', 'pyspark-sql', 'pyspark-streaming', 'pyspark-testing']
python3.12 python_implementation is CPython
python3.12 version is: Python 3.12.0rc2
...
Finished test(python3.12): pyspark.sql.functions (70s)
Finished test(python3.12): pyspark.sql.streaming.readwriter (77s)
Tests passed in 398 seconds

Skipped tests in pyspark.ml.tests.connect.test_parity_torch_data_loader with python3.12:
    test_data_loader (pyspark.ml.tests.connect.test_parity_torch_data_loader.TorchDistributorBaselineUnitTestsOnConnect.test_data_loader) ... skipped 'torch is required'
    test_data_loader (pyspark.ml.torch.tests.test_data_loader.TorchDistributorDataLoaderUnitTests.test_data_loader) ... skipped 'torch is required'

Skipped tests in pyspark.ml.torch.tests.test_data_loader with python3.12:
    test_data_loader (pyspark.ml.torch.tests.test_data_loader.TorchDistributorDataLoaderUnitTests.test_data_loader) ... skipped 'torch is required'

Was this patch authored or co-authored using generative AI tooling?

No.

@dongjoon-hyun
Copy link
Member Author

All Python tests passed.

Screenshot 2023-09-29 at 7 47 11 PM

@dongjoon-hyun
Copy link
Member Author

Could you review this, @LuciferYang ?

@dongjoon-hyun dongjoon-hyun marked this pull request as draft September 30, 2023 04:18
@dongjoon-hyun
Copy link
Member Author

It seems that some side-effects on the installation.

[info] org.apache.spark.sql.SQLQueryTestSuite *** ABORTED *** (6 milliseconds)
[info]   java.lang.RuntimeException: Python executable [python3] and/or pyspark are unavailable.

Let me check this first. I converted this PR to the Draft for that.

@github-actions github-actions bot added the INFRA label Sep 30, 2023
@dongjoon-hyun dongjoon-hyun marked this pull request as ready for review September 30, 2023 07:04
@dongjoon-hyun
Copy link
Member Author

Hi, @HyukjinKwon . Could you review this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh actually we can't just add a required dependency easily. It works with pip install but it wouldn't work when users download Spark from the official website (as they have to manually install packaging dependency in their nodes) - in case of Py4J, we already contain that in our release so it works.

We should either port packaging into our release, or find a workaround to avoid adding the required dependency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take a look around (mid) next week too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently, it should be installed manually like Numpy.

However, it's independent from supporting Python 3.12 itself.

Can we do that porting after this?

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Sep 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can proceed to port packaging tomorrow separately too because USA doesn't have Chuseok, @HyukjinKwon . :)

@dongjoon-hyun
Copy link
Member Author

Could you review this PR, @viirya ?

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the purpose of Python 3.12 support, this change looks okay.

Regarding packaging issue. I'm wondering how do we treat other dependencies like pyarrow, pandas, etc.? They are required for certain functions but I think they are not included in the distribution too. So for users who download Spark from the official website, they are required to install these dependencies by themselves, right?

From this change, looks like the usage of packaging is limited to pandas and sql (connect and pandas related) modules, cannot we also list it as required dependencies for certain modules, just like how we treat pyarrow, pandas, etc.? Is it a generally required dependency like py4j? It seems not, from the list of pyspark change here.

I suppose that you can run PySpark from downloaded distribution without packaging if not touching connect and pandas functions. If so, it doesn't looks like a special thing like py4j, but more like pandas, pyarrow which we list as conditionally required dependencies.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Sep 30, 2023

Thank you for review, @viirya .

You are right. We consider them optional, @viirya .

So for users who download Spark from the official website, they are required to install these dependencies by themselves, right?

Actually, pyspark shell fails to start. So, we need to embed packaging like Py4J.

I suppose that you can run PySpark from downloaded distribution without packaging if not touching connect and pandas functions.

https://github.com/apache/spark/blob/master/python/lib/py4j-0.10.9.7-src.zip

However, embedding is not a recommendation from the official Python community. So, I didn't do that in this PR. I'll handle that usability issue as an independent JIRA after this PR.

@viirya
Copy link
Member

viirya commented Sep 30, 2023

Actually, pyspark shell fails to start. So, we need to embed packaging like Py4J.

From the error shown in the description for distutils, I guess you mean pyspark shell fails to start at same location if packaging is not installed?

    from pyspark.sql.pandas.conversion import PandasConversionMixin
  File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/sql/pandas/conversion.py", line 29, in <module>
    from distutils.version import LooseVersion
ModuleNotFoundError: No module named 'distutils'

This is only triggerd if pandas/pyarrow is installed/enabled. As you can see it is under sql/pandas, if you don't have pandas installed, I think it won't run into there.

@dongjoon-hyun
Copy link
Member Author

Although you are right, it should work in that way. However, the technical problem is that Apache PySpark code doesn't have a conditional check in SparkSession properly. Here is the error message when we don't have both pandas and packaging package.

$ bin/pyspark
Python 3.12.0rc2 (main, Sep 21 2023, 21:22:29) [Clang 14.0.0 (clang-1400.0.28.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
  File "/Users/dongjoon/PRS/SPARK-44120/python/pyspark/shell.py", line 31, in <module>
    import pyspark
  File "/Users/dongjoon/PRS/SPARK-44120/python/pyspark/__init__.py", line 148, in <module>
    from pyspark.sql import SQLContext, HiveContext, Row  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dongjoon/PRS/SPARK-44120/python/pyspark/sql/__init__.py", line 43, in <module>
    from pyspark.sql.context import SQLContext, HiveContext, UDFRegistration, UDTFRegistration
  File "/Users/dongjoon/PRS/SPARK-44120/python/pyspark/sql/context.py", line 39, in <module>
    from pyspark.sql.session import _monkey_patch_RDD, SparkSession
  File "/Users/dongjoon/PRS/SPARK-44120/python/pyspark/sql/session.py", line 47, in <module>
    from pyspark.sql.dataframe import DataFrame
  File "/Users/dongjoon/PRS/SPARK-44120/python/pyspark/sql/dataframe.py", line 64, in <module>
    from pyspark.sql.pandas.conversion import PandasConversionMixin
  File "/Users/dongjoon/PRS/SPARK-44120/python/pyspark/sql/pandas/conversion.py", line 29, in <module>
    from packaging.version import Version
ModuleNotFoundError: No module named 'packaging'

@dongjoon-hyun
Copy link
Member Author

Let me follow your advice, @viirya . I'm trying to add a conditional import on ./python/pyspark/sql/pandas/conversion.py.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's my phone so please ignore this comment if it's wrong. If we use distutils only for LooseVersion, we could just have our own LooseVersion in PySpark too (instead of embedding the package)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also sounds good if LooseVersion is easy to port into Spark.

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Oct 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not difficult to copy or to reimplement, @HyukjinKwon and @viirya .

However, it's a PSF License which is different from Py4J or cloudpickle (under BSD 3-Clause),

spark/LICENSE-binary

Lines 460 to 462 in 58c24a5

python/lib/py4j-*-src.zip
python/pyspark/cloudpickle.py
python/pyspark/join.py

I believe it's compatible but we need to take a look once more before doing that. It's because our Apache Spark (up to 3.5.0) binary distribution doesn't include Python Software Foundation yet.

Copy link
Member

@HyukjinKwon HyukjinKwon Oct 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just write our own instead of copying. It's just a version string check in the end. I have not seen their code yet so I can write one on my own too :-).

@dongjoon-hyun
Copy link
Member Author

According to the review comment, I made a separate PR, @HyukjinKwon and @viirya

dongjoon-hyun added a commit that referenced this pull request Oct 1, 2023
### What changes were proposed in this pull request?

This PR aims to remove `distutils` usage from Spark codebase.

**BEFORE**
```
$ git grep distutils | wc -l
      38
```

**AFTER**
```
$ git grep distutils | wc -l
       0
```

### Why are the changes needed?

Currently, Apache Spark ignores the warnings but the module itself is removed at Python 3.12 via [PEP-632](https://peps.python.org/pep-0632) in favor of `packaging` package.

https://github.com/apache/spark/blob/58c24a5719b8717ea37347c668c9df8a3714ae3c/python/pyspark/__init__.py#L54-L56

 Initially, #43184 proposed to follow Python community guideline via using `packaging` package, but, this PR is embedding `LooseVersion` Python class to avoid adding a new package requirement.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43192 from dongjoon-hyun/remove_distutils.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to run Python 3.12 test in CI?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Oct 1, 2023

Do you want to run Python 3.12 test in CI?

Not yet. (1) Python 3.12 is not released yet. It will be released Tomorrow. We can add Python 3.12 to actions/setup-python GitHub Action CI later. (2) We may need to add a separate daily pipeline instead of the main PR builder because the AS-IS PySpark (PySpark+Pandas+Connect) tests require too much resources.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The work to make PySpark work with Python 3.12 is actually done in #43192. Looks good to me as this adds Python 3.12 to the file. We can deal with CI later after it is released.

@dongjoon-hyun
Copy link
Member Author

Thank you, @viirya and all.
Merged to master for Apache Spark 4.0.0.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

dongjoon-hyun added a commit that referenced this pull request Jul 25, 2024
… Python 12

### What changes were proposed in this pull request?

This PR aims to use `17-jammy` tag instead of `17` to prevent Python 12.

### Why are the changes needed?

Two days ago, `eclipse-temurin:17` switched its baseline OS to `Ubuntu 24.04` which brings `Python 3.12`.

```
$ docker run -it --rm eclipse-temurin:17 cat /etc/os-release | grep VERSION_ID
VERSION_ID="24.04"

$ docker run -it --rm eclipse-temurin:17-jammy cat /etc/os-release | grep VERSION_ID
VERSION_ID="22.04"
```

Since Python 3.12 supported is added only to Apache Spark 4.0.0, we need to keep using the previous OS, `Ubuntu 22.04`.

- #43184
- #43192

### Does this PR introduce _any_ user-facing change?

No. This aims to recover to the same OS for consistent behavior.

### How was this patch tested?

Pass the CIs with K8s IT. Currently, it's broken at Python image building phase.

- https://github.com/apache/spark/actions/workflows/build_branch35.yml

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47488 from dongjoon-hyun/SPARK-49005.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun added a commit that referenced this pull request Jul 25, 2024
…vent Python 3.12

### What changes were proposed in this pull request?

This PR aims to use `17-jammy` tag instead of `17-jre` to prevent Python 12.

### Why are the changes needed?

Two days ago, `eclipse-temurin:17` switched its baseline OS to `Ubuntu 24.04` which brings `Python 3.12`.

```
$ docker run -it --rm eclipse-temurin:17-jre cat /etc/os-release | grep VERSION_ID
VERSION_ID="24.04"

$ docker run -it --rm eclipse-temurin:17-jammy cat /etc/os-release | grep VERSION_ID
VERSION_ID="22.04"
```

Since Python 3.12 supported is added only to Apache Spark 4.0.0, we need to keep using the previous OS, `Ubuntu 22.04`.

- #43184
- #43192

### Does this PR introduce _any_ user-facing change?

No. This aims to recover to the same OS for consistent behavior.

### How was this patch tested?

Pass the CIs with K8s IT. Currently, it's broken at Python image building phase.

- https://github.com/apache/spark/actions/workflows/build_branch34.yml

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47489 from dongjoon-hyun/SPARK-49005-3.4.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun dongjoon-hyun deleted the SPARK-44120 branch November 16, 2025 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants