-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-34764][CORE][K8S][UI] Propagate reason for exec loss to Web UI #32436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34764][CORE][K8S][UI] Propagate reason for exec loss to Web UI #32436
Conversation
…n, but we want to propegate the loss reason in doRemoveExecutor so we can't gait on isExecutorActive
…gate it to the test as well
…ason, update the test to match this change
...ore/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala
Outdated
Show resolved
Hide resolved
...rc/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManagerSuite.scala
Outdated
Show resolved
Hide resolved
...ore/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala
Outdated
Show resolved
Hide resolved
|
Thank you for pinging me, @holdenk . |
core/src/main/resources/org/apache/spark/ui/static/executorspage.js
Outdated
Show resolved
Hide resolved
| Shuffle Write</span></th> | ||
| <th>Logs</th> | ||
| <th>Thread Dump</th> | ||
| <th>Exec Loss Reason</th> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this empty always in non-K8s resource managers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No exec loss reason is populated for YARN as well :)
...ore/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala
Outdated
Show resolved
Hide resolved
| case 126 => "(not executable - possibly perm or arch)" | ||
| case 137 => "(SIGKILL, possible container OOM)" | ||
| case 139 => "(SIGSEGV: that's unexpected)" | ||
| case 255 => "(exit-1, your guess is as good as mine)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expected we have an error code for Evicted due to out of disk during worker decommission. What is the error code for that, @holdenk ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think it's going to be inconsistent depending on how exactly it shows up (e.g. does the JVM have an uncaught exception trying to write a file or do we exceed the resource quota). So for now I don't have a clear exit code to map to it unfortunately. I could try and add a base handler for IO errors that are uncaught to exit with a specific code, but I'd rather do that in a separate PR.
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #138145 has finished for PR 32436 at commit
|
...ore/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala
Outdated
Show resolved
Hide resolved
...ore/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala
Outdated
Show resolved
Hide resolved
core/src/main/resources/org/apache/spark/ui/static/executorspage.js
Outdated
Show resolved
Hide resolved
|
Can you suggest an event log we can use for checking the UI? If there is none checked in already as none is good enough (there is no interesting executor losses in the eventlog) then we should consider adding a new one. This would be helpful not only for the current reviewers but for future developers of this area could use it to check (at least by eye) their new changes: whether they have broken this feature or not. WDYT? |
…ark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala Co-authored-by: Reid Mewborne <[email protected]>
|
So if you want it's pretty easy to trigger OOMs with the GroupByKey example (it's how I did the manual test). I'm not sure about event logs though, do we have those committed somewhere for folks to use in debugging w/history server? |
|
We are storing event logs for unitest, here is the directory: https://github.com/apache/spark/tree/master/core/src/test/resources/spark-events |
|
Test build #138178 has finished for PR 32436 at commit
|
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #138390 has finished for PR 32436 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #138401 has finished for PR 32436 at commit
|
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice to have this, thanks @holdenk, LGTM
|
K8s failure is unrelated ( |
|
Going to merge now, thanks everyone for the review and feedback. We can iterate on adding more exit codes in future PRs if we see any common questions on the user@ list :) |
|
Merged to the current dev branch :) Since it's a new feature not planning on back porting. |
| return threadDumpEnabled; | ||
| } | ||
|
|
||
| function formatLossReason(removeReason, type, row) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, uhoh. Seems like the JavaScript linter is broken by this:
https://github.com/apache/spark/runs/2579648547
added 118 packages in 1.482s
/__w/spark/spark/core/src/main/resources/org/apache/spark/ui/static/executorspage.js
34:41 error 'type' is defined but never used. Allowed unused args must match /^_ignored_.*/u no-unused-vars
34:47 error 'row' is defined but never used. Allowed unused args must match /^_ignored_.*/u no-unused-vars
35:1 error Expected indentation of 2 spaces but found 4 indent
36:1 error Expected indentation of 4 spaces but found 7 indent
37:1 error Expected indentation of 2 spaces but found 4 indent
38:1 error Expected indentation of 4 spaces but found 7 indent
39:1 error Expected indentation of 2 spaces but found 4 indent
556:1 error Expected indentation of 14 spaces but found 16 indent
557:1 error Expected indentation of 14 spaces but found 16 indent
Mind taking a look please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just made a quick followup to fix up the build: #32541
…r JavaScript linter ### What changes were proposed in this pull request? This PR is a followup of #32436 which broke JavaScript linter. There was a logical conflict - the linter was added after the last successful test run in that PR. ``` added 118 packages in 1.482s /__w/spark/spark/core/src/main/resources/org/apache/spark/ui/static/executorspage.js 34:41 error 'type' is defined but never used. Allowed unused args must match /^_ignored_.*/u no-unused-vars 34:47 error 'row' is defined but never used. Allowed unused args must match /^_ignored_.*/u no-unused-vars 35:1 error Expected indentation of 2 spaces but found 4 indent 36:1 error Expected indentation of 4 spaces but found 7 indent 37:1 error Expected indentation of 2 spaces but found 4 indent 38:1 error Expected indentation of 4 spaces but found 7 indent 39:1 error Expected indentation of 2 spaces but found 4 indent 556:1 error Expected indentation of 14 spaces but found 16 indent 557:1 error Expected indentation of 14 spaces but found 16 indent ``` ### Why are the changes needed? To recover the build ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested: ```bash ./dev/lint-js lint-js checks passed. ``` Closes #32541 from HyukjinKwon/SPARK-34764-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Kousuke Saruta <[email protected]>
Adds the exec loss reason to the Spark web UI & in doing so also fix the Kube integration to pass exec loss reason into core. UI change:  Debugging Spark jobs is *hard*, making it clearer why executors have exited could help. Yes a new column on the executor page. K8s unit test updated to validate exec loss reasons are passed through regardless of exec alive state, manual testing to validate the UI. Closes apache#32436 from holdenk/SPARK-34764-propegate-reason-for-exec-loss. Lead-authored-by: Holden Karau <[email protected]> Co-authored-by: Holden Karau <[email protected]> Signed-off-by: Holden Karau <[email protected]> Fix
What changes were proposed in this pull request?
Adds the exec loss reason to the Spark web UI & in doing so also fix the Kube integration to pass exec loss reason into core.
UI change:
Why are the changes needed?
Debugging Spark jobs is hard, making it clearer why executors have exited could help.
Does this PR introduce any user-facing change?
Yes a new column on the executor page.
How was this patch tested?
K8s unit test updated to validate exec loss reasons are passed through regardless of exec alive state, manual testing to validate the UI.