Skip to content

Conversation

@skonto
Copy link
Contributor

@skonto skonto commented Oct 18, 2019

What changes were proposed in this pull request?

  • Adds a flag to make the driver exit in case of an oom error in cluster mode (enabled by default).
  • Adds integration tests for the K8s manager.
  • Adds verbose flag support within the driver's container.

Why are the changes needed?

See for details Spark-27900. In addition, this follows the discussion here: #24796. Without this pods on K8s will keep running although Spark has failed. In addition current behavior creates a problem to the Spark Operator and any other operator as it cannot detect failure at the K8s level.

How was this patch tested?

Manually by launching SparkPi with a large number 100000000 which leads to an oom due to the large number of tasks allocated.

$kubectl logs spark-pi-driver -n spark
19/10/18 09:47:02 INFO KubernetesClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
19/10/18 09:47:02 INFO BlockManagerMasterEndpoint: Registering block manager 172.17.0.6:33435 with 413.9 MiB RAM, BlockManagerId(2, 172.17.0.6, 33435, None)
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 14"...

Also there are two integration tests for the K8s resource manager. Testing might be needed for the other managers.

@skonto
Copy link
Contributor Author

skonto commented Oct 18, 2019

@holdenk @erikerlandson pls review.

@SparkQA
Copy link

SparkQA commented Oct 18, 2019

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/17254/

@SparkQA
Copy link

SparkQA commented Oct 18, 2019

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/17254/

@SparkQA
Copy link

SparkQA commented Oct 18, 2019

Test build #112270 has finished for PR 26161 at commit 5505083.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@skonto
Copy link
Contributor Author

skonto commented Oct 21, 2019

@erikerlandson can I get a merge? Gentle ping :)

@skonto
Copy link
Contributor Author

skonto commented Oct 23, 2019

@holdenk gentle ping.

@dongjoon-hyun
Copy link
Member

Retest this please.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-27900][K8s] Add jvm oom flag in cluster mode [SPARK-27900][K8s] Add spark.driver.killOnOOMError flag in cluster mode Oct 24, 2019
// set oom error handling in cluster mode
if (sparkConf.get(KILL_ON_OOM_ERROR) && deployMode == CLUSTER) {
val driverJavaOptions = sparkConf.getOption(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS)
.map( _ + " ")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. .map( _ + " ") -> .map(_ + " ")

if (sparkConf.get(KILL_ON_OOM_ERROR) && deployMode == CLUSTER) {
val driverJavaOptions = sparkConf.getOption(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS)
.map( _ + " ")
.getOrElse("") + "-XX:OnOutOfMemoryError=\"kill -9 %p\""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a general PR for SparkSubmit, does this work on Windows?
ExitOnOutOfMemoryError will be a better choice, @skonto .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, I lost some context since this is the 3rd try for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we should stick w/ OnOutOfMemoryError, because ExitOnOutOfMemoryError was not present till java 8u92 (discussed earlier here: #24796 (comment)). I don't think we specify a minimum version within java 8, so we might have to stick with this.

but yeah, we probably have to make sure it doesn't do anything too weird on windows (does spark actually run in anything other than local mode on windows?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Windows, I prefer to use JVM option. Personally, I don't think Apache Spark 3.0.0 will be used on JDK 8u91 or older. Apache Spark 3.0.0 starts a new age of JDK11. 😄


private[spark] val KILL_ON_OOM_ERROR = ConfigBuilder("spark.driver.killOnOOMError")
.doc("Whether to kill the driver on an oom error in cluster mode.")
.booleanConf.createWithDefault(true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please split this into two lines like the other conf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this should be false by default to avoid the behavior change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I think we agreed to go ahead and change the default
#25229 (comment)

3.0 is a good chance to do this

and I agree about splitting to two lines to match the style for other confs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Got it. In that case, I'm +1 for the true by default.

BTW, @squito . Do you think we need to add a migration guide for this behavior change?

Also, cc @gatorsmile .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we agreed to change the default, we want to fail by default.

.doc("The amount of memory used per page in bytes")
.bytesConf(ByteUnit.BYTE)
.createOptional

Copy link
Member

@dongjoon-hyun dongjoon-hyun Oct 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not touch the irrelevant place.

$VERBOSE_FLAG
--conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS"
--deploy-mode client
"$@"
Copy link
Member

@dongjoon-hyun dongjoon-hyun Oct 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file contains orthogonal changes. Please refer this file. You can file a new JIRA issue for this.

Copy link
Contributor Author

@skonto skonto Oct 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun this is required by the tests, so it could be just DEBUG mode, disabled by default (as it is now). It is not meant to be another feature and does no harm. I want to debug the verbose output so tests in K8s can get the java option values set in the driver. Is there another way to trigger this (beyond writing a main that prints them and adding it to Spark examples package)?

runSparkRemoteCheckAndVerifyCompletion(appArgs = Array(REMOTE_PAGE_RANK_FILE_NAME))
}

test("Run SparkPi without the default exit on OOM error flag", k8sTestTag) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are suggesting this PR as a bug fix, you need to add SPARK-27900 prefix to the newly added test case names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I can do that.

"https://storage.googleapis.com/spark-k8s-integration-tests/files/pagerank_data.txt"
val REMOTE_PAGE_RANK_FILE_NAME = "pagerank_data.txt"
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not add this new line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

test("Run SparkPi without the default exit on OOM error flag", k8sTestTag) {
sparkAppConf
.set("spark.driver.extraJavaOptions", "-Dspark.test.foo=spark.test.bar")
.set("spark.kubernetes.driverEnv.DRIVER_VERBOSE", "true")
Copy link
Member

@dongjoon-hyun dongjoon-hyun Oct 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try SPARK_PRINT_LAUNCH_COMMAND instead of new DRIVER_VERBOSE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @skonto .
I know this is the 3rd try for you. I also had many experiences like this in the community. Anyway, this PR looks better because it is now aiming narrowly driver surgically. I left a few comments. Could you update the PR?

@SparkQA
Copy link

SparkQA commented Oct 24, 2019

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/17589/

@SparkQA
Copy link

SparkQA commented Oct 24, 2019

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/17589/

@SparkQA
Copy link

SparkQA commented Oct 24, 2019

Test build #112624 has finished for PR 26161 at commit 5505083.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-27900][K8s] Add spark.driver.killOnOOMError flag in cluster mode [SPARK-27900][CORE][K8s] Add spark.driver.killOnOOMError flag in cluster mode Oct 25, 2019
@skonto
Copy link
Contributor Author

skonto commented Oct 27, 2019

@dongjoon-hyun no problem I will update it.

@skonto
Copy link
Contributor Author

skonto commented Nov 8, 2019

Ok I will fix this over the weekend, sorry for the delay.

@skonto
Copy link
Contributor Author

skonto commented Apr 13, 2020

I will update the PR.

@dongjoon-hyun
Copy link
Member

Thank you, @skonto !

@skonto
Copy link
Contributor Author

skonto commented Apr 14, 2020

I am back.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jul 24, 2020
@github-actions github-actions bot closed this Jul 25, 2020
@skonto
Copy link
Contributor Author

skonto commented Nov 23, 2020

@dongjoon-hyun I will revive this unless it is fixed already. I guess I need to create a new PR correct?

@dimon222
Copy link

Was this ever fixed?

@holdenk
Copy link
Contributor

holdenk commented May 13, 2024

I don't think so. We manually set kill on OOM in our config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants