Skip to content

Conversation

@rmannibucau
Copy link

@rmannibucau rmannibucau commented Dec 5, 2025

What changes were proposed in this pull request?

Re-enable assembly artifacts - not that I'm not 100% sure of release process so maybe I missed something.
Not sure the link with ./dev/make-distribution.sh

Why are the changes needed?

Being able to download spark distro (ideally any flavor but the base one is the most important) enables to provide custom distro with prepackaged bundles (like Apache Iceberg for ex).
--packages option can be tricky on some env or with some deps.

Does this PR introduce any user-facing change?

No zip published, nothing breaks.

How was this patch tested?

Not tested in release process.

Was this patch authored or co-authored using generative AI tooling?

Nop.

@github-actions github-actions bot added the BUILD label Dec 5, 2025
@rmannibucau rmannibucau changed the title Deploy the assembly as an artifact [WIP] Deploy the assembly as an artifact Dec 6, 2025
@rmannibucau
Copy link
Author

@dongjoon-hyun hello, can you guide me on how you do the releases so I can attach the tgz/zip to the deployment properly please? This would be a great enhancement to the release

@dongjoon-hyun dongjoon-hyun marked this pull request as draft December 10, 2025 18:48
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you know, Apache Spark community intentionally dropped assembly way at Apache Spark 2.0.0 due to many issues including security, @rmannibucau .

@dongjoon-hyun
Copy link
Member

Let me close this because this is a great regression from Apache Spark community perspective. We can continue our discussion on the closed PR.

This would be a great enhancement to the release

@dongjoon-hyun
Copy link
Member

Personally, I don't recommend assembly because I believe assembly feature has not been maintained properly since 2.0.0.

@rmannibucau
Copy link
Author

@dongjoon-hyun the assembly is what is proposed on apache spark website download so spark didnt drop anything, just dropped the automotion and central publication which is negative from an user standpoint and leads to issues in downstream usages and automotion since the download urls are not stable (it would be from central.

Also note that from a security standpoint it is not worse than all apache spark distro (from the tgz of the download area to the docker image) by design.

So overall I don't see why not fixing the convenient deliverable fro mmy window, will help the community and not hurt spark more than it is today since the bundles are archives automatically anyway and must be "immutable" (in the spirit since nothing is never immutable).

Can you please revise it since it doesn't impact spark project more than having to push the binary(ies) on nexus?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 10, 2025

Could you give me the specific link of that part from Apache Spark website?

the assembly is what is proposed on apache spark website

I thought you are trying to build a far jar like Apache Spark 1.6.x. Did I understand your question correctly?

@rmannibucau
Copy link
Author

@dongjoon-hyun this is what I'm referring https://spark.apache.org/downloads.html (you know, latest and previous are not using the same link and archives.apache.org are not considered stable so both cases are broken to consume the zip/tgz).

I thought you are trying to build a far jar like Apache Spark 1.6.x. Did I understand your question correctly?

No, I'm more trying to do a custom distro to use on local machine of ops to interact with a Spark Cluster - but I need to add a bunch of jars and props.

@rmannibucau
Copy link
Author

@dongjoon-hyun any hope we work on that issue so the assembly is consummable with maven dependency plugin "natively"? Happy to adjust the PR once I know how you do release it (if manual it is fine to keep it closed and just add it in the release steps for me if you do prefer)

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 19, 2025

It doesn't make sense to me because Apache Spark Standalone Cluster has the full distribution already to launch Spark Master and Spark Worker JVM. What you need is to deploy only your application, not a custom Apache Spark again.

No, I'm more trying to do a custom distro to use on local machine of ops to interact with a Spark Cluster - but I need to add a bunch of jars and props.

FYI, Apache Spark provides official Apache K8s Operator here. Please see the examples.

@rmannibucau
Copy link
Author

@dongjoon-hyun

It doesn't make sense to me because Apache Spark Standalone Cluster has the full distribution already to launch Spark Master and Spark Worker JVM. What you need is to deploy only your application, not a custom Apache Spark again.

This is exactly the point to have a custom distribution, kind of include the application inside it.

Why do you have spark shell in spark distribution? It is an application so must not be there from your statement, this is exactly the same.

Ultimately I do not want to have to rely on the internet or a random network config (think enterprise) to download the application so I want to bundle it upfront.

FYI, Apache Spark provides official Apache K8s Operator

This is what i'm using for most applicative use cases but I need the custom distro case for human interaction (spark sql) - once again just to make it simple.

--packages and all friends relying on ivy are not stable enough and too poorly configurable (excludes are a pain for ex) and having to download spark and then the libraries is pointless.

Look at apache iceberg case, you just need the jars to use it with Apache Spark SQL Shell, why can't we make a distribution with the application prebundled to make it easier. Will also enable to customize some library version (parquet) which are conflicting and enforces you to use userClassPathFirst which has other side effects.

So yes, this is needed to cover all usages even if most automated usages are, as you say, done through other ways.

Side note: just out of curiosity, why do you fight that much for something trivial to do? Is there a blocker in the release to deploy the zip/targz? We do it in plenty of apache projects and every is happy.

@dongjoon-hyun
Copy link
Member

No, you don't need to do that for your applications., @rmannibucau .

This is exactly the point to have a custom distribution, kind of include the application inside it.

Please submit your application according to the Apache Spark community guideline.

For the following question, Spark Shell is a part of Apache Spark Interactive Environment, not an application.

Why do you have spark shell in spark distribution? It is an application so must not be there from your statement, this is exactly the same.

Let's talk about the applications (including yours). Since it should be built and deployed independently, you can see that it's only 1.5MB in spark-4.1.0-bin-hadoop3/examples/jars directory instead of spark-4.1.0-bin-hadoop3/jars, @rmannibucau . You need to build like spark-examples_2.13-4.1.0.jar independently and provide via submitting-applications.html.

$ ls -alh spark-4.1.0-bin-hadoop3/examples/jars
total 3296
drwxr-xr-x@ 4 dongjoon  staff   128B Dec 11 23:32 .
drwxr-xr-x@ 4 dongjoon  staff   128B Dec 11 23:32 ..
-rw-r--r--@ 1 dongjoon  staff    79K Dec 11 23:32 scopt_2.13-3.7.1.jar
-rw-r--r--@ 1 dongjoon  staff   1.5M Dec 11 23:32 spark-examples_2.13-4.1.0.jar

@rmannibucau
Copy link
Author

rmannibucau commented Dec 22, 2025

@dongjoon-hyun ok, let step back cause for me application can be interactive or not but you differentiate both very strictly in your answer so let's only focus on interactive case please (sql shell).

How launching ./bin/spark-whatever I can do a SELECT * from my_catalog.my_table out of the box - submitting an application defeats the purpose of providing any distribution there?

If it helps here is the current not convenient at all way to setup the connection - truncated some props:

# spark distro
base="$(dirname $0)"

"$base/spark-sql" \
    --packages org.apache.iceberg:iceberg-spark-4.0_2.13:1.10.0,org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:1.10.0,org.apache.iceberg:iceberg-aws:1.10.0 \
    --conf spark.driver.userClassPathFirst=true \
    --conf spark.executor.userClassPathFirst=true \
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.iceberg.vectorization.enabled=true \
    --conf spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.iceberg.type=rest \
    --conf spark.sql.catalog.iceberg.uri=http://iceberg \
    --conf spark.sql.catalog.iceberg.warehouse=s3a://warehouse/ \
    --conf spark.sql.catalog.iceberg.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
    --conf spark.sql.catalog.iceberg.client.region="${AWS_REGION:-us-east-1}" \
    --conf spark.sql.session.timeZone=UTC \
    --conf spark.sql.iceberg.writer.target-file-size-bytes=268435456 \
    --conf spark.sql.parquet.compression.codec=zstd \
    --conf spark.sql.legacy.charVarcharAsString=true

side note: in real there are way more properties and even a custom catalog.

one very not satisfying solution is --packages because it doesn't enable to fully control the jars (transitive ones for ex) but for Shell/SQL case you also do not want to tune a lot your catalog locally so a custom distro was my compromise.

if there is something easier I'm happy to use it instead, idea was to quickly get feedback from the iceberg data without having to write an app for that - using shell SQL one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants