Skip to content

Commit fcefdec

Browse files
committed
Moved submitting apps to separate doc
1 parent 61d72b4 commit fcefdec

File tree

1 file changed

+153
-0
lines changed

1 file changed

+153
-0
lines changed

docs/submitting-applications.md

Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
---
2+
layout: global
3+
title: Submitting Applications
4+
---
5+
6+
The `spark-submit` script in Spark's `bin` directory is used to launch applications on a cluster.
7+
It can use all of Spark's supported [cluster managers](cluster-overview.html#cluster-manager-types)
8+
through a uniform interface so you don't have to configure your application specially for each one.
9+
10+
# Bundling Your Application's Dependencies
11+
If your code depends on other projects, you will need to package them alongside
12+
your application in order to distribute the code to a Spark cluster. To do this,
13+
to create an assembly jar (or "uber" jar) containing your code and its dependencies. Both
14+
[sbt](https://github.com/sbt/sbt-assembly) and
15+
[Maven](http://maven.apache.org/plugins/maven-shade-plugin/)
16+
have assembly plugins. When creating assembly jars, list Spark and Hadoop
17+
as `provided` dependencies; these need not be bundled since they are provided by
18+
the cluster manager at runtime. Once you have an assembled jar you can call the `bin/spark-submit`
19+
script as shown here while passing your jar.
20+
21+
For Python, you can use the `--py-files` argument of `spark-submit` to add `.py`, `.zip` or `.egg`
22+
files to be distributed with your application. If you depend on multiple Python files we recommend
23+
packaging them into a `.zip` or `.egg`.
24+
25+
# Launching Applications with spark-submit
26+
27+
Once a user application is bundled, it can be launched using the `bin/spark-submit` script
28+
This script takes care of setting up the classpath with Spark and its
29+
dependencies, and can support different cluster managers and deploy modes that Spark supports:
30+
31+
{% highlight bash %}
32+
./bin/spark-submit \
33+
--class <main-class>
34+
--master <master-url> \
35+
--deploy-mode <deploy-mode> \
36+
... # other options
37+
<application-jar> \
38+
[application-arguments]
39+
{% endhighlight %}
40+
41+
Some of the commonly used options are:
42+
43+
* `--class`: The entry point for your application (e.g. `org.apache.spark.examples.SparkPi`)
44+
* `--master`: The [master URL](#master-urls) for the cluster (e.g. `spark://23.195.26.187:7077`)
45+
* `--deploy-mode`: Whether to deploy your driver program within the cluster or run it locally as an external client (either `cluster` or `client`)
46+
* `application-jar`: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes.
47+
* `application-arguments`: Arguments passed to the main method of your main class, if any
48+
49+
For Python applications, simply pass a `.py` file in the place of `<application-jar>` instead of a JAR,
50+
and add Python `.zip`, `.egg` or `.py` files to the search path with `--py-files`.
51+
52+
To enumerate all options available to `spark-submit` run it with `--help`. Here are a few
53+
examples of common options:
54+
55+
{% highlight bash %}
56+
# Run application locally on 8 cores
57+
./bin/spark-submit \
58+
--class org.apache.spark.examples.SparkPi
59+
--master local[8] \
60+
/path/to/examples.jar \
61+
100
62+
63+
# Run on a Spark standalone cluster
64+
./bin/spark-submit \
65+
--class org.apache.spark.examples.SparkPi
66+
--master spark://207.184.161.138:7077 \
67+
--executor-memory 20G \
68+
--total-executor-cores 100 \
69+
/path/to/examples.jar \
70+
1000
71+
72+
# Run on a YARN cluster
73+
export HADOOP_CONF_DIR=XXX
74+
./bin/spark-submit \
75+
--class org.apache.spark.examples.SparkPi
76+
--master yarn-cluster \ # can also be `yarn-client` for client mode
77+
--executor-memory 20G \
78+
--num-executors 50 \
79+
/path/to/examples.jar \
80+
1000
81+
82+
# Run a Python application on a cluster
83+
./bin/spark-submit \
84+
--master spark://207.184.161.138:7077 \
85+
examples/src/main/python/pi.py \
86+
1000
87+
{% endhighlight %}
88+
89+
# Master URLs
90+
91+
The master URL passed to Spark can be in one of the following formats:
92+
93+
<table class="table">
94+
<tr><th>Master URL</th><th>Meaning</th></tr>
95+
<tr><td> local </td><td> Run Spark locally with one worker thread (i.e. no parallelism at all). </td></tr>
96+
<tr><td> local[K] </td><td> Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine). </td></tr>
97+
<tr><td> local[*] </td><td> Run Spark locally with as many worker threads as logical cores on your machine.</td></tr>
98+
<tr><td> spark://HOST:PORT </td><td> Connect to the given <a href="spark-standalone.html">Spark standalone
99+
cluster</a> master. The port must be whichever one your master is configured to use, which is 7077 by default.
100+
</td></tr>
101+
<tr><td> mesos://HOST:PORT </td><td> Connect to the given <a href="running-on-mesos.html">Mesos</a> cluster.
102+
The port must be whichever one your is configured to use, which is 5050 by default.
103+
Or, for a Mesos cluster using ZooKeeper, use <code>mesos://zk://...</code>.
104+
</td></tr>
105+
<tr><td> yarn-client </td><td> Connect to a <a href="running-on-yarn.html"> YARN </a> cluster in
106+
client mode. The cluster location will be found based on the HADOOP_CONF_DIR variable.
107+
</td></tr>
108+
<tr><td> yarn-cluster </td><td> Connect to a <a href="running-on-yarn.html"> YARN </a> cluster in
109+
cluster mode. The cluster location will be found based on HADOOP_CONF_DIR.
110+
</td></tr>
111+
</table>
112+
113+
114+
# Loading Configuration from a File
115+
116+
The `spark-submit` script can load default [Spark configuration values](configuration.html) from a
117+
properties file and pass them on to your application. By default it will read options
118+
from `conf/spark-defaults.conf` in the Spark directory. For more detail, see the section on
119+
[loading default configurations](configuration.html#loading-default-configurations).
120+
121+
Loading default Spark configurations this way can obviate the need for certain flags to
122+
`spark-submit`. For instance, if the `spark.master` property is set, you can safely omit the
123+
`--master` flag from `spark-submit`. In general, configuration values explicitly set on a
124+
`SparkConf` take the highest precedence, then flags passed to `spark-submit`, then values in the
125+
defaults file.
126+
127+
If you are ever unclear where configuration options are coming from, you can print out fine-grained
128+
debugging information by running `spark-submit` with the `--verbose` option.
129+
130+
# Advanced Dependency Management
131+
When using `spark-submit`, the application jar along with any jars included with the `--jars` option
132+
will be automatically transferred to the cluster. Spark uses the following URL scheme to allow
133+
different strategies for disseminating jars:
134+
135+
- **file:** - Absolute paths and `file:/` URIs are served by the driver's HTTP file server, and
136+
every executor pulls the file from the driver HTTP server.
137+
- **hdfs:**, **http:**, **https:**, **ftp:** - these pull down files and JARs from the URI as expected
138+
- **local:** - a URI starting with local:/ is expected to exist as a local file on each worker node. This
139+
means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker,
140+
or shared via NFS, GlusterFS, etc.
141+
142+
Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes.
143+
This can use up a significant amount of space over time and will need to be cleaned up. With YARN, cleanup
144+
is handled automatically, and with Spark standalone, automatic cleanup can be configured with the
145+
`spark.worker.cleanup.appDataTtl` property.
146+
147+
For python, the equivalent `--py-files` option can be used to distribute `.egg`, `.zip` and `.py` libraries
148+
to executors.
149+
150+
# More Information
151+
152+
Once you have deployed your application, the [cluster mode overview](cluster-overview.html) describes
153+
the components involved in distributed execution, and how to monitor and debug applications.

0 commit comments

Comments
 (0)