-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27900][CORE][K8s] Add spark.driver.killOnOOMError flag in cluster mode
#26161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@holdenk @erikerlandson pls review. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #112270 has finished for PR 26161 at commit
|
|
@erikerlandson can I get a merge? Gentle ping :) |
|
@holdenk gentle ping. |
|
Retest this please. |
spark.driver.killOnOOMError flag in cluster mode
| // set oom error handling in cluster mode | ||
| if (sparkConf.get(KILL_ON_OOM_ERROR) && deployMode == CLUSTER) { | ||
| val driverJavaOptions = sparkConf.getOption(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS) | ||
| .map( _ + " ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. .map( _ + " ") -> .map(_ + " ")
| if (sparkConf.get(KILL_ON_OOM_ERROR) && deployMode == CLUSTER) { | ||
| val driverJavaOptions = sparkConf.getOption(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS) | ||
| .map( _ + " ") | ||
| .getOrElse("") + "-XX:OnOutOfMemoryError=\"kill -9 %p\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a general PR for SparkSubmit, does this work on Windows?
ExitOnOutOfMemoryError will be a better choice, @skonto .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, I lost some context since this is the 3rd try for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think we should stick w/ OnOutOfMemoryError, because ExitOnOutOfMemoryError was not present till java 8u92 (discussed earlier here: #24796 (comment)). I don't think we specify a minimum version within java 8, so we might have to stick with this.
but yeah, we probably have to make sure it doesn't do anything too weird on windows (does spark actually run in anything other than local mode on windows?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Windows, I prefer to use JVM option. Personally, I don't think Apache Spark 3.0.0 will be used on JDK 8u91 or older. Apache Spark 3.0.0 starts a new age of JDK11. 😄
|
|
||
| private[spark] val KILL_ON_OOM_ERROR = ConfigBuilder("spark.driver.killOnOOMError") | ||
| .doc("Whether to kill the driver on an oom error in cluster mode.") | ||
| .booleanConf.createWithDefault(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please split this into two lines like the other conf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this should be false by default to avoid the behavior change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I think we agreed to go ahead and change the default
#25229 (comment)
3.0 is a good chance to do this
and I agree about splitting to two lines to match the style for other confs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Got it. In that case, I'm +1 for the true by default.
BTW, @squito . Do you think we need to add a migration guide for this behavior change?
Also, cc @gatorsmile .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we agreed to change the default, we want to fail by default.
| .doc("The amount of memory used per page in bytes") | ||
| .bytesConf(ByteUnit.BYTE) | ||
| .createOptional | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not touch the irrelevant place.
| $VERBOSE_FLAG | ||
| --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" | ||
| --deploy-mode client | ||
| "$@" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file contains orthogonal changes. Please refer this file. You can file a new JIRA issue for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun this is required by the tests, so it could be just DEBUG mode, disabled by default (as it is now). It is not meant to be another feature and does no harm. I want to debug the verbose output so tests in K8s can get the java option values set in the driver. Is there another way to trigger this (beyond writing a main that prints them and adding it to Spark examples package)?
| runSparkRemoteCheckAndVerifyCompletion(appArgs = Array(REMOTE_PAGE_RANK_FILE_NAME)) | ||
| } | ||
|
|
||
| test("Run SparkPi without the default exit on OOM error flag", k8sTestTag) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are suggesting this PR as a bug fix, you need to add SPARK-27900 prefix to the newly added test case names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I can do that.
| "https://storage.googleapis.com/spark-k8s-integration-tests/files/pagerank_data.txt" | ||
| val REMOTE_PAGE_RANK_FILE_NAME = "pagerank_data.txt" | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not add this new line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
| test("Run SparkPi without the default exit on OOM error flag", k8sTestTag) { | ||
| sparkAppConf | ||
| .set("spark.driver.extraJavaOptions", "-Dspark.test.foo=spark.test.bar") | ||
| .set("spark.kubernetes.driverEnv.DRIVER_VERBOSE", "true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you try SPARK_PRINT_LAUNCH_COMMAND instead of new DRIVER_VERBOSE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @skonto .
I know this is the 3rd try for you. I also had many experiences like this in the community. Anyway, this PR looks better because it is now aiming narrowly driver surgically. I left a few comments. Could you update the PR?
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #112624 has finished for PR 26161 at commit
|
spark.driver.killOnOOMError flag in cluster modespark.driver.killOnOOMError flag in cluster mode
|
@dongjoon-hyun no problem I will update it. |
|
Ok I will fix this over the weekend, sorry for the delay. |
|
I will update the PR. |
|
Thank you, @skonto ! |
|
I am back. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
|
@dongjoon-hyun I will revive this unless it is fixed already. I guess I need to create a new PR correct? |
|
Was this ever fixed? |
|
I don't think so. We manually set kill on OOM in our config. |
What changes were proposed in this pull request?
Why are the changes needed?
See for details Spark-27900. In addition, this follows the discussion here: #24796. Without this pods on K8s will keep running although Spark has failed. In addition current behavior creates a problem to the Spark Operator and any other operator as it cannot detect failure at the K8s level.
How was this patch tested?
Manually by launching SparkPi with a large number 100000000 which leads to an oom due to the large number of tasks allocated.
Also there are two integration tests for the K8s resource manager. Testing might be needed for the other managers.