You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/configuration.md
+39-26Lines changed: 39 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,15 +3,8 @@ layout: global
3
3
title: Spark Configuration
4
4
---
5
5
6
-
Spark provides three locations to configure the system:
7
-
8
-
*[Spark properties](#spark-properties) control most application parameters and can be set by
9
-
passing a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object to SparkContext,
10
-
or through the `conf/spark-defaults.conf` properties file.
11
-
*[Environment variables](#environment-variables) can be used to set per-machine settings, such as
12
-
the IP address, through the `conf/spark-env.sh` script on each node.
13
-
*[Logging](#configuring-logging) can be configured through `log4j.properties`.
14
-
6
+
* This will become a table of contents (this text will be scraped).
7
+
{:toc}
15
8
16
9
# Spark Properties
17
10
@@ -149,7 +142,8 @@ Apart from these, the following properties are also available, and may be useful
149
142
<td><code>spark.executor.memory</code></td>
150
143
<td>512m</td>
151
144
<td>
152
-
Amount of memory to use per executor process, in the same format as JVM memory strings (e.g. <code>512m</code>, <code>2g</code>).
145
+
Amount of memory to use per executor process, in the same format as JVM memory strings
146
+
(e.g. <code>512m</code>, <code>2g</code>).
153
147
</td>
154
148
</tr>
155
149
<tr>
@@ -422,7 +416,8 @@ Apart from these, the following properties are also available, and may be useful
422
416
<td><code>spark.files.overwrite</code></td>
423
417
<td>false</td>
424
418
<td>
425
-
Whether to overwrite files added through SparkContext.addFile() when the target file exists and its contents do not match those of the source.
419
+
Whether to overwrite files added through SparkContext.addFile() when the target file exists and its contents do not
420
+
match those of the source.
426
421
</td>
427
422
</tr>
428
423
<tr>
@@ -446,8 +441,9 @@ Apart from these, the following properties are also available, and may be useful
446
441
<td><code>spark.tachyonStore.baseDir</code></td>
447
442
<td>System.getProperty("java.io.tmpdir")</td>
448
443
<td>
449
-
Directories of the Tachyon File System that store RDDs. The Tachyon file system's URL is set by <code>spark.tachyonStore.url</code>.
450
-
It can also be a comma-separated list of multiple directories on Tachyon file system.
444
+
Directories of the Tachyon File System that store RDDs. The Tachyon file system's URL is set by
445
+
<code>spark.tachyonStore.url</code>. It can also be a comma-separated list of multiple directories
446
+
on Tachyon file system.
451
447
</td>
452
448
</tr>
453
449
<tr>
@@ -504,21 +500,33 @@ Apart from these, the following properties are also available, and may be useful
504
500
<td><code>spark.akka.heartbeat.pauses</code></td>
505
501
<td>600</td>
506
502
<td>
507
-
This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you plan to use this feature (Not recommended). Acceptable heart beat pause in seconds for akka. This can be used to control sensitivity to gc pauses. Tune this in combination of `spark.akka.heartbeat.interval` and `spark.akka.failure-detector.threshold` if you need to.
503
+
This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you
504
+
plan to use this feature (Not recommended). Acceptable heart beat pause in seconds for akka. This can be used to
505
+
control sensitivity to gc pauses. Tune this in combination of `spark.akka.heartbeat.interval` and
506
+
`spark.akka.failure-detector.threshold` if you need to.
This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you plan to use this feature (Not recommended). This maps to akka's `akka.remote.transport-failure-detector.threshold`. Tune this in combination of `spark.akka.heartbeat.pauses` and `spark.akka.heartbeat.interval` if you need to.
513
+
This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you
514
+
plan to use this feature (Not recommended). This maps to akka's `akka.remote.transport-failure-detector.threshold`.
515
+
Tune this in combination of `spark.akka.heartbeat.pauses` and `spark.akka.heartbeat.interval` if you need to.
This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you plan to use this feature (Not recommended). A larger interval value in seconds reduces network overhead and a smaller value ( ~ 1 s) might be more informative for akka's failure detector. Tune this in combination of `spark.akka.heartbeat.pauses` and `spark.akka.failure-detector.threshold` if you need to. Only positive use case for using failure detector can be, a sensistive failure detector can help evict rogue executors really quick. However this is usually not the case as gc pauses and network lags are expected in a real spark cluster. Apart from that enabling this leads to a lot of exchanges of heart beats between nodes leading to flooding the network with those.
522
+
This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you
523
+
plan to use this feature (Not recommended). A larger interval value in seconds reduces network overhead and a
524
+
smaller value ( ~ 1 s) might be more informative for akka's failure detector. Tune this in combination
525
+
of `spark.akka.heartbeat.pauses` and `spark.akka.failure-detector.threshold` if you need to. Only positive use
526
+
case for using failure detector can be, a sensistive failure detector can help evict rogue executors really
527
+
quick. However this is usually not the case as gc pauses and network lags are expected in a real spark cluster.
528
+
Apart from that enabling this leads to a lot of exchanges of heart beats between nodes leading to flooding the
529
+
network with those.
522
530
</td>
523
531
</tr>
524
532
</table>
@@ -578,7 +586,8 @@ Apart from these, the following properties are also available, and may be useful
578
586
<td><code>spark.speculation</code></td>
579
587
<td>false</td>
580
588
<td>
581
-
If set to "true", performs speculative execution of tasks. This means if one or more tasks are running slowly in a stage, they will be re-launched.
589
+
If set to "true", performs speculative execution of tasks. This means if one or more tasks are running slowly in a
590
+
stage, they will be re-launched.
582
591
</td>
583
592
</tr>
584
593
<tr>
@@ -739,13 +748,13 @@ Apart from these, the following properties are also available, and may be useful
739
748
740
749
# Environment Variables
741
750
742
-
Certain Spark settings can be configured through environment variables, which are read from the`conf/spark-env.sh`
743
-
script in the directory where Spark is installed (or `conf/spark-env.cmd` on Windows). In Standalone and Mesos modes,
744
-
this file can give machine specific information such as hostnames. It is also sourced when running local
745
-
Spark applications or submission scripts.
751
+
Certain Spark settings can be configured through environment variables, which are read from the
752
+
`conf/spark-env.sh`script in the directory where Spark is installed (or `conf/spark-env.cmd` on
753
+
Windows). In Standalone and Mesos modes, this file can give machine specific information such as
754
+
hostnames. It is also sourced when running local Spark applications or submission scripts.
746
755
747
-
Note that `conf/spark-env.sh` does not exist by default when Spark is installed. However, you can copy
748
-
`conf/spark-env.sh.template` to create it. Make sure you make the copy executable.
756
+
Note that `conf/spark-env.sh` does not exist by default when Spark is installed. However, you can
757
+
copy `conf/spark-env.sh.template` to create it. Make sure you make the copy executable.
749
758
750
759
The following variables can be set in `spark-env.sh`:
751
760
@@ -770,12 +779,16 @@ The following variables can be set in `spark-env.sh`:
770
779
</tr>
771
780
</table>
772
781
773
-
In addition to the above, there are also options for setting up the Spark [standalone cluster scripts](spark-standalone.html#cluster-launch-scripts), such as number of cores to use on each machine and maximum memory.
782
+
In addition to the above, there are also options for setting up the Spark
783
+
[standalone cluster scripts](spark-standalone.html#cluster-launch-scripts), such as number of cores to use on each
784
+
machine and maximum memory.
774
785
775
786
Since `spark-env.sh` is a shell script, some of these can be set programmatically -- for example, you might
776
787
compute `SPARK_LOCAL_IP` by looking up the IP of a specific network interface.
777
788
778
789
# Configuring Logging
779
790
780
-
Spark uses [log4j](http://logging.apache.org/log4j/) for logging. You can configure it by adding a `log4j.properties`
781
-
file in the `conf` directory. One way to start is to copy the existing `log4j.properties.template` located there.
791
+
Spark uses [log4j](http://logging.apache.org/log4j/) for logging. You can configure it by adding a
792
+
`log4j.properties` file in the `conf` directory. One way to start is to copy the existing
Copy file name to clipboardExpand all lines: docs/spark-standalone.md
+88Lines changed: 88 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -286,6 +286,94 @@ In addition, detailed log output for each job is also written to the work direct
286
286
You can run Spark alongside your existing Hadoop cluster by just launching it as a separate service on the same machines. To access Hadoop data from Spark, just use a hdfs:// URL (typically `hdfs://<namenode>:9000/path`, but you can find the right URL on your Hadoop Namenode's web UI). Alternatively, you can set up a separate cluster for Spark, and still have it access HDFS over the network; this will be slower than disk-local access, but may not be a concern if you are still running in the same local area network (e.g. you place a few Spark machines on each rack that you have Hadoop on).
287
287
288
288
289
+
# Configuring Ports for Network Security
290
+
291
+
Spark makes heavy use of the network, and some environments have strict requirements for using tight
292
+
firewall settings. Below are the primary ports that Spark uses for its communication and how to
<td>Jetty-based. Each of these services starts on a random port that cannot be configured</td>
373
+
</tr>
374
+
375
+
</table>
376
+
289
377
# High Availability
290
378
291
379
By default, standalone scheduling clusters are resilient to Worker failures (insofar as Spark itself is resilient to losing work by moving it to other workers). However, the scheduler uses a Master to make scheduling decisions, and this (by default) creates a single point of failure: if the Master crashes, no new applications can be created. In order to circumvent this, we have two high availability schemes, detailed below.
0 commit comments