|
| 1 | +--- |
| 2 | +layout: global |
| 3 | +title: Submitting Applications |
| 4 | +--- |
| 5 | + |
| 6 | +The `spark-submit` script in Spark's `bin` directory is used to launch applications on a cluster. |
| 7 | +It can use all of Spark's supported [cluster managers](cluster-overview.html#cluster-manager-types) |
| 8 | +through a uniform interface so you don't have to configure your application specially for each one. |
| 9 | + |
| 10 | +# Bundling Your Application's Dependencies |
| 11 | +If your code depends on other projects, you will need to package them alongside |
| 12 | +your application in order to distribute the code to a Spark cluster. To do this, |
| 13 | +to create an assembly jar (or "uber" jar) containing your code and its dependencies. Both |
| 14 | +[sbt](https://github.com/sbt/sbt-assembly) and |
| 15 | +[Maven](http://maven.apache.org/plugins/maven-shade-plugin/) |
| 16 | +have assembly plugins. When creating assembly jars, list Spark and Hadoop |
| 17 | +as `provided` dependencies; these need not be bundled since they are provided by |
| 18 | +the cluster manager at runtime. Once you have an assembled jar you can call the `bin/spark-submit` |
| 19 | +script as shown here while passing your jar. |
| 20 | + |
| 21 | +For Python, you can use the `--py-files` argument of `spark-submit` to add `.py`, `.zip` or `.egg` |
| 22 | +files to be distributed with your application. If you depend on multiple Python files we recommend |
| 23 | +packaging them into a `.zip` or `.egg`. |
| 24 | + |
| 25 | +# Launching Applications with spark-submit |
| 26 | + |
| 27 | +Once a user application is bundled, it can be launched using the `bin/spark-submit` script |
| 28 | +This script takes care of setting up the classpath with Spark and its |
| 29 | +dependencies, and can support different cluster managers and deploy modes that Spark supports: |
| 30 | + |
| 31 | +{% highlight bash %} |
| 32 | +./bin/spark-submit \ |
| 33 | + --class <main-class> |
| 34 | + --master <master-url> \ |
| 35 | + --deploy-mode <deploy-mode> \ |
| 36 | + ... # other options |
| 37 | + <application-jar> \ |
| 38 | + [application-arguments] |
| 39 | +{% endhighlight %} |
| 40 | + |
| 41 | +Some of the commonly used options are: |
| 42 | + |
| 43 | +* `--class`: The entry point for your application (e.g. `org.apache.spark.examples.SparkPi`) |
| 44 | +* `--master`: The [master URL](#master-urls) for the cluster (e.g. `spark://23.195.26.187:7077`) |
| 45 | +* `--deploy-mode`: Whether to deploy your driver program within the cluster or run it locally as an external client (either `cluster` or `client`) |
| 46 | +* `application-jar`: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes. |
| 47 | +* `application-arguments`: Arguments passed to the main method of your main class, if any |
| 48 | + |
| 49 | +For Python applications, simply pass a `.py` file in the place of `<application-jar>` instead of a JAR, |
| 50 | +and add Python `.zip`, `.egg` or `.py` files to the search path with `--py-files`. |
| 51 | + |
| 52 | +To enumerate all options available to `spark-submit` run it with `--help`. Here are a few |
| 53 | +examples of common options: |
| 54 | + |
| 55 | +{% highlight bash %} |
| 56 | +# Run application locally on 8 cores |
| 57 | +./bin/spark-submit \ |
| 58 | + --class org.apache.spark.examples.SparkPi |
| 59 | + --master local[8] \ |
| 60 | + /path/to/examples.jar \ |
| 61 | + 100 |
| 62 | + |
| 63 | +# Run on a Spark standalone cluster |
| 64 | +./bin/spark-submit \ |
| 65 | + --class org.apache.spark.examples.SparkPi |
| 66 | + --master spark://207.184.161.138:7077 \ |
| 67 | + --executor-memory 20G \ |
| 68 | + --total-executor-cores 100 \ |
| 69 | + /path/to/examples.jar \ |
| 70 | + 1000 |
| 71 | + |
| 72 | +# Run on a YARN cluster |
| 73 | +export HADOOP_CONF_DIR=XXX |
| 74 | +./bin/spark-submit \ |
| 75 | + --class org.apache.spark.examples.SparkPi |
| 76 | + --master yarn-cluster \ # can also be `yarn-client` for client mode |
| 77 | + --executor-memory 20G \ |
| 78 | + --num-executors 50 \ |
| 79 | + /path/to/examples.jar \ |
| 80 | + 1000 |
| 81 | + |
| 82 | +# Run a Python application on a cluster |
| 83 | +./bin/spark-submit \ |
| 84 | + --master spark://207.184.161.138:7077 \ |
| 85 | + examples/src/main/python/pi.py \ |
| 86 | + 1000 |
| 87 | +{% endhighlight %} |
| 88 | + |
| 89 | +# Master URLs |
| 90 | + |
| 91 | +The master URL passed to Spark can be in one of the following formats: |
| 92 | + |
| 93 | +<table class="table"> |
| 94 | +<tr><th>Master URL</th><th>Meaning</th></tr> |
| 95 | +<tr><td> local </td><td> Run Spark locally with one worker thread (i.e. no parallelism at all). </td></tr> |
| 96 | +<tr><td> local[K] </td><td> Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine). </td></tr> |
| 97 | +<tr><td> local[*] </td><td> Run Spark locally with as many worker threads as logical cores on your machine.</td></tr> |
| 98 | +<tr><td> spark://HOST:PORT </td><td> Connect to the given <a href="spark-standalone.html">Spark standalone |
| 99 | + cluster</a> master. The port must be whichever one your master is configured to use, which is 7077 by default. |
| 100 | +</td></tr> |
| 101 | +<tr><td> mesos://HOST:PORT </td><td> Connect to the given <a href="running-on-mesos.html">Mesos</a> cluster. |
| 102 | + The port must be whichever one your is configured to use, which is 5050 by default. |
| 103 | + Or, for a Mesos cluster using ZooKeeper, use <code>mesos://zk://...</code>. |
| 104 | +</td></tr> |
| 105 | +<tr><td> yarn-client </td><td> Connect to a <a href="running-on-yarn.html"> YARN </a> cluster in |
| 106 | +client mode. The cluster location will be found based on the HADOOP_CONF_DIR variable. |
| 107 | +</td></tr> |
| 108 | +<tr><td> yarn-cluster </td><td> Connect to a <a href="running-on-yarn.html"> YARN </a> cluster in |
| 109 | +cluster mode. The cluster location will be found based on HADOOP_CONF_DIR. |
| 110 | +</td></tr> |
| 111 | +</table> |
| 112 | + |
| 113 | + |
| 114 | +# Loading Configuration from a File |
| 115 | + |
| 116 | +The `spark-submit` script can load default [Spark configuration values](configuration.html) from a |
| 117 | +properties file and pass them on to your application. By default it will read options |
| 118 | +from `conf/spark-defaults.conf` in the Spark directory. For more detail, see the section on |
| 119 | +[loading default configurations](configuration.html#loading-default-configurations). |
| 120 | + |
| 121 | +Loading default Spark configurations this way can obviate the need for certain flags to |
| 122 | +`spark-submit`. For instance, if the `spark.master` property is set, you can safely omit the |
| 123 | +`--master` flag from `spark-submit`. In general, configuration values explicitly set on a |
| 124 | +`SparkConf` take the highest precedence, then flags passed to `spark-submit`, then values in the |
| 125 | +defaults file. |
| 126 | + |
| 127 | +If you are ever unclear where configuration options are coming from, you can print out fine-grained |
| 128 | +debugging information by running `spark-submit` with the `--verbose` option. |
| 129 | + |
| 130 | +# Advanced Dependency Management |
| 131 | +When using `spark-submit`, the application jar along with any jars included with the `--jars` option |
| 132 | +will be automatically transferred to the cluster. Spark uses the following URL scheme to allow |
| 133 | +different strategies for disseminating jars: |
| 134 | + |
| 135 | +- **file:** - Absolute paths and `file:/` URIs are served by the driver's HTTP file server, and |
| 136 | + every executor pulls the file from the driver HTTP server. |
| 137 | +- **hdfs:**, **http:**, **https:**, **ftp:** - these pull down files and JARs from the URI as expected |
| 138 | +- **local:** - a URI starting with local:/ is expected to exist as a local file on each worker node. This |
| 139 | + means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, |
| 140 | + or shared via NFS, GlusterFS, etc. |
| 141 | + |
| 142 | +Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes. |
| 143 | +This can use up a significant amount of space over time and will need to be cleaned up. With YARN, cleanup |
| 144 | +is handled automatically, and with Spark standalone, automatic cleanup can be configured with the |
| 145 | +`spark.worker.cleanup.appDataTtl` property. |
| 146 | + |
| 147 | +For python, the equivalent `--py-files` option can be used to distribute `.egg`, `.zip` and `.py` libraries |
| 148 | +to executors. |
| 149 | + |
| 150 | +# More Information |
| 151 | + |
| 152 | +Once you have deployed your application, the [cluster mode overview](cluster-overview.html) describes |
| 153 | +the components involved in distributed execution, and how to monitor and debug applications. |
0 commit comments