[WIP] Deploy the assembly as an artifact #53351

rmannibucau · 2025-12-05T21:11:03Z

What changes were proposed in this pull request?

Re-enable assembly artifacts - not that I'm not 100% sure of release process so maybe I missed something.
Not sure the link with ./dev/make-distribution.sh

Why are the changes needed?

Being able to download spark distro (ideally any flavor but the base one is the most important) enables to provide custom distro with prepackaged bundles (like Apache Iceberg for ex).
--packages option can be tricky on some env or with some deps.

Does this PR introduce any user-facing change?

No zip published, nothing breaks.

How was this patch tested?

Not tested in release process.

Was this patch authored or co-authored using generative AI tooling?

Nop.

rmannibucau · 2025-12-08T12:39:04Z

@dongjoon-hyun hello, can you guide me on how you do the releases so I can attach the tgz/zip to the deployment properly please? This would be a great enhancement to the release

dongjoon-hyun

As you know, Apache Spark community intentionally dropped assembly way at Apache Spark 2.0.0 due to many issues including security, @rmannibucau .

#11796

dongjoon-hyun · 2025-12-10T18:53:37Z

Let me close this because this is a great regression from Apache Spark community perspective. We can continue our discussion on the closed PR.

This would be a great enhancement to the release

dongjoon-hyun · 2025-12-10T19:11:33Z

Personally, I don't recommend assembly because I believe assembly feature has not been maintained properly since 2.0.0.

rmannibucau · 2025-12-10T23:02:19Z

@dongjoon-hyun the assembly is what is proposed on apache spark website download so spark didnt drop anything, just dropped the automotion and central publication which is negative from an user standpoint and leads to issues in downstream usages and automotion since the download urls are not stable (it would be from central.

Also note that from a security standpoint it is not worse than all apache spark distro (from the tgz of the download area to the docker image) by design.

So overall I don't see why not fixing the convenient deliverable fro mmy window, will help the community and not hurt spark more than it is today since the bundles are archives automatically anyway and must be "immutable" (in the spirit since nothing is never immutable).

Can you please revise it since it doesn't impact spark project more than having to push the binary(ies) on nexus?

dongjoon-hyun · 2025-12-10T23:37:36Z

Could you give me the specific link of that part from Apache Spark website?

the assembly is what is proposed on apache spark website

I thought you are trying to build a far jar like Apache Spark 1.6.x. Did I understand your question correctly?

rmannibucau · 2025-12-11T08:17:54Z

@dongjoon-hyun this is what I'm referring https://spark.apache.org/downloads.html (you know, latest and previous are not using the same link and archives.apache.org are not considered stable so both cases are broken to consume the zip/tgz).

I thought you are trying to build a far jar like Apache Spark 1.6.x. Did I understand your question correctly?

No, I'm more trying to do a custom distro to use on local machine of ops to interact with a Spark Cluster - but I need to add a bunch of jars and props.

rmannibucau · 2025-12-15T21:33:50Z

@dongjoon-hyun any hope we work on that issue so the assembly is consummable with maven dependency plugin "natively"? Happy to adjust the PR once I know how you do release it (if manual it is fine to keep it closed and just add it in the release steps for me if you do prefer)

dongjoon-hyun · 2025-12-19T08:23:33Z

It doesn't make sense to me because Apache Spark Standalone Cluster has the full distribution already to launch Spark Master and Spark Worker JVM. What you need is to deploy only your application, not a custom Apache Spark again.

No, I'm more trying to do a custom distro to use on local machine of ops to interact with a Spark Cluster - but I need to add a bunch of jars and props.

FYI, Apache Spark provides official Apache K8s Operator here. Please see the examples.

rmannibucau · 2025-12-20T16:21:45Z

@dongjoon-hyun

It doesn't make sense to me because Apache Spark Standalone Cluster has the full distribution already to launch Spark Master and Spark Worker JVM. What you need is to deploy only your application, not a custom Apache Spark again.

This is exactly the point to have a custom distribution, kind of include the application inside it.

Why do you have spark shell in spark distribution? It is an application so must not be there from your statement, this is exactly the same.

Ultimately I do not want to have to rely on the internet or a random network config (think enterprise) to download the application so I want to bundle it upfront.

FYI, Apache Spark provides official Apache K8s Operator

This is what i'm using for most applicative use cases but I need the custom distro case for human interaction (spark sql) - once again just to make it simple.

--packages and all friends relying on ivy are not stable enough and too poorly configurable (excludes are a pain for ex) and having to download spark and then the libraries is pointless.

Look at apache iceberg case, you just need the jars to use it with Apache Spark SQL Shell, why can't we make a distribution with the application prebundled to make it easier. Will also enable to customize some library version (parquet) which are conflicting and enforces you to use userClassPathFirst which has other side effects.

So yes, this is needed to cover all usages even if most automated usages are, as you say, done through other ways.

Side note: just out of curiosity, why do you fight that much for something trivial to do? Is there a blocker in the release to deploy the zip/targz? We do it in plenty of apache projects and every is happy.

dongjoon-hyun · 2025-12-21T22:09:17Z

No, you don't need to do that for your applications., @rmannibucau .

This is exactly the point to have a custom distribution, kind of include the application inside it.

Please submit your application according to the Apache Spark community guideline.

https://spark.apache.org/docs/latest/submitting-applications.html

For the following question, Spark Shell is a part of Apache Spark Interactive Environment, not an application.

Why do you have spark shell in spark distribution? It is an application so must not be there from your statement, this is exactly the same.

Let's talk about the applications (including yours). Since it should be built and deployed independently, you can see that it's only 1.5MB in spark-4.1.0-bin-hadoop3/examples/jars directory instead of spark-4.1.0-bin-hadoop3/jars, @rmannibucau . You need to build like spark-examples_2.13-4.1.0.jar independently and provide via submitting-applications.html.

$ ls -alh spark-4.1.0-bin-hadoop3/examples/jars
total 3296
drwxr-xr-x@ 4 dongjoon  staff   128B Dec 11 23:32 .
drwxr-xr-x@ 4 dongjoon  staff   128B Dec 11 23:32 ..
-rw-r--r--@ 1 dongjoon  staff    79K Dec 11 23:32 scopt_2.13-3.7.1.jar
-rw-r--r--@ 1 dongjoon  staff   1.5M Dec 11 23:32 spark-examples_2.13-4.1.0.jar

rmannibucau · 2025-12-22T13:12:37Z

@dongjoon-hyun ok, let step back cause for me application can be interactive or not but you differentiate both very strictly in your answer so let's only focus on interactive case please (sql shell).

How launching ./bin/spark-whatever I can do a SELECT * from my_catalog.my_table out of the box - submitting an application defeats the purpose of providing any distribution there?

If it helps here is the current not convenient at all way to setup the connection - truncated some props:

# spark distro
base="$(dirname $0)"

"$base/spark-sql" \
    --packages org.apache.iceberg:iceberg-spark-4.0_2.13:1.10.0,org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:1.10.0,org.apache.iceberg:iceberg-aws:1.10.0 \
    --conf spark.driver.userClassPathFirst=true \
    --conf spark.executor.userClassPathFirst=true \
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.iceberg.vectorization.enabled=true \
    --conf spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.iceberg.type=rest \
    --conf spark.sql.catalog.iceberg.uri=http://iceberg \
    --conf spark.sql.catalog.iceberg.warehouse=s3a://warehouse/ \
    --conf spark.sql.catalog.iceberg.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
    --conf spark.sql.catalog.iceberg.client.region="${AWS_REGION:-us-east-1}" \
    --conf spark.sql.session.timeZone=UTC \
    --conf spark.sql.iceberg.writer.target-file-size-bytes=268435456 \
    --conf spark.sql.parquet.compression.codec=zstd \
    --conf spark.sql.legacy.charVarcharAsString=true

side note: in real there are way more properties and even a custom catalog.

one very not satisfying solution is --packages because it doesn't enable to fully control the jars (transitive ones for ex) but for Shell/SQL case you also do not want to tune a lot your catalog locally so a custom distro was my compromise.

if there is something easier I'm happy to use it instead, idea was to quickly get feedback from the iceberg data without having to write an app for that - using shell SQL one.

Deploy the assembly as an artifact

11f3c55

github-actions bot added the BUILD label Dec 5, 2025

rmannibucau changed the title ~~Deploy the assembly as an artifact~~ [WIP] Deploy the assembly as an artifact Dec 6, 2025

dongjoon-hyun marked this pull request as draft December 10, 2025 18:48

dongjoon-hyun requested changes Dec 10, 2025

View reviewed changes

dongjoon-hyun closed this Dec 10, 2025

[WIP] Deploy the assembly as an artifact #53351

[WIP] Deploy the assembly as an artifact #53351

Uh oh!

Conversation

rmannibucau commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

rmannibucau commented Dec 8, 2025

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Dec 10, 2025

Uh oh!

dongjoon-hyun commented Dec 10, 2025

Uh oh!

rmannibucau commented Dec 10, 2025

Uh oh!

dongjoon-hyun commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmannibucau commented Dec 11, 2025

Uh oh!

rmannibucau commented Dec 15, 2025

Uh oh!

dongjoon-hyun commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmannibucau commented Dec 20, 2025

Uh oh!

dongjoon-hyun commented Dec 21, 2025

Uh oh!

rmannibucau commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rmannibucau commented Dec 5, 2025 •

edited

Loading

dongjoon-hyun commented Dec 10, 2025 •

edited

Loading

dongjoon-hyun commented Dec 19, 2025 •

edited

Loading

rmannibucau commented Dec 22, 2025 •

edited

Loading