Skip to content

Conversation

@holdenk
Copy link
Contributor

@holdenk holdenk commented May 4, 2021

What changes were proposed in this pull request?

Adds the exec loss reason to the Spark web UI & in doing so also fix the Kube integration to pass exec loss reason into core.

UI change:

image

Why are the changes needed?

Debugging Spark jobs is hard, making it clearer why executors have exited could help.

Does this PR introduce any user-facing change?

Yes a new column on the executor page.

How was this patch tested?

K8s unit test updated to validate exec loss reasons are passed through regardless of exec alive state, manual testing to validate the UI.

@holdenk holdenk changed the title [SPARK-34764][CORE][K8S] Propagate reason for exec loss to Web UI [SPARK-34764][CORE][K8S][UI] Propagate reason for exec loss to Web UI May 4, 2021
@holdenk
Copy link
Contributor Author

holdenk commented May 4, 2021

cc @dongjoon-hyun & @BryanCutler

@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @holdenk .
cc @attilapiros

Shuffle Write</span></th>
<th>Logs</th>
<th>Thread Dump</th>
<th>Exec Loss Reason</th>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this empty always in non-K8s resource managers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No exec loss reason is populated for YARN as well :)

case 126 => "(not executable - possibly perm or arch)"
case 137 => "(SIGKILL, possible container OOM)"
case 139 => "(SIGSEGV: that's unexpected)"
case 255 => "(exit-1, your guess is as good as mine)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expected we have an error code for Evicted due to out of disk during worker decommission. What is the error code for that, @holdenk ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think it's going to be inconsistent depending on how exactly it shows up (e.g. does the JVM have an uncaught exception trying to write a file or do we exceed the resource quota). So for now I don't have a clear exit code to map to it unfortunately. I could try and add a base handler for IO errors that are uncaught to exit with a specific code, but I'd rather do that in a separate PR.

@SparkQA
Copy link

SparkQA commented May 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42666/

@SparkQA
Copy link

SparkQA commented May 5, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42666/

@SparkQA
Copy link

SparkQA commented May 5, 2021

Test build #138145 has finished for PR 32436 at commit 433ee83.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@attilapiros
Copy link
Contributor

Can you suggest an event log we can use for checking the UI? If there is none checked in already as none is good enough (there is no interesting executor losses in the eventlog) then we should consider adding a new one. This would be helpful not only for the current reviewers but for future developers of this area could use it to check (at least by eye) their new changes: whether they have broken this feature or not.

WDYT?

…ark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala

Co-authored-by: Reid Mewborne <[email protected]>
@holdenk
Copy link
Contributor Author

holdenk commented May 5, 2021

So if you want it's pretty easy to trigger OOMs with the GroupByKey example (it's how I did the manual test). I'm not sure about event logs though, do we have those committed somewhere for folks to use in debugging w/history server?

@attilapiros
Copy link
Contributor

We are storing event logs for unitest, here is the directory: https://github.com/apache/spark/tree/master/core/src/test/resources/spark-events

@SparkQA
Copy link

SparkQA commented May 5, 2021

Test build #138178 has finished for PR 32436 at commit 19355d4.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 5, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42699/

@SparkQA
Copy link

SparkQA commented May 11, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42913/

@SparkQA
Copy link

SparkQA commented May 11, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42913/

@SparkQA
Copy link

SparkQA commented May 11, 2021

Test build #138390 has finished for PR 32436 at commit a4263cd.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 11, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42923/

@SparkQA
Copy link

SparkQA commented May 11, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42923/

@SparkQA
Copy link

SparkQA commented May 11, 2021

Test build #138401 has finished for PR 32436 at commit 3ab5a09.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait ExtractValue extends Expression
  • trait ShuffledJoin extends JoinCodegenSupport

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice to have this, thanks @holdenk, LGTM

@holdenk
Copy link
Contributor Author

holdenk commented May 11, 2021

K8s failure is unrelated (The connection to the server localhost:8080 was refused - did you specify the right host or port?). I'll merge this Thursday unless anyone has additional suggestions :)

@holdenk
Copy link
Contributor Author

holdenk commented May 13, 2021

Going to merge now, thanks everyone for the review and feedback. We can iterate on adding more exit codes in future PRs if we see any common questions on the user@ list :)

@asfgit asfgit closed this in 160b3be May 13, 2021
@holdenk
Copy link
Contributor Author

holdenk commented May 13, 2021

Merged to the current dev branch :) Since it's a new feature not planning on back porting.

return threadDumpEnabled;
}

function formatLossReason(removeReason, type, row) {
Copy link
Member

@HyukjinKwon HyukjinKwon May 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, uhoh. Seems like the JavaScript linter is broken by this:

https://github.com/apache/spark/runs/2579648547

added 118 packages in 1.482s

/__w/spark/spark/core/src/main/resources/org/apache/spark/ui/static/executorspage.js
   34:41  error  'type' is defined but never used. Allowed unused args must match /^_ignored_.*/u  no-unused-vars
   34:47  error  'row' is defined but never used. Allowed unused args must match /^_ignored_.*/u   no-unused-vars
   35:1   error  Expected indentation of 2 spaces but found 4                                      indent
   36:1   error  Expected indentation of 4 spaces but found 7                                      indent
   37:1   error  Expected indentation of 2 spaces but found 4                                      indent
   38:1   error  Expected indentation of 4 spaces but found 7                                      indent
   39:1   error  Expected indentation of 2 spaces but found 4                                      indent
  556:1   error  Expected indentation of 14 spaces but found 16                                    indent
  557:1   error  Expected indentation of 14 spaces but found 16                                    indent

Mind taking a look please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just made a quick followup to fix up the build: #32541

sarutak pushed a commit that referenced this pull request May 14, 2021
…r JavaScript linter

### What changes were proposed in this pull request?

This PR is a followup of #32436 which broke JavaScript linter. There was a logical conflict - the linter was added after the last successful test run in that PR.

```
added 118 packages in 1.482s

/__w/spark/spark/core/src/main/resources/org/apache/spark/ui/static/executorspage.js
   34:41  error  'type' is defined but never used. Allowed unused args must match /^_ignored_.*/u  no-unused-vars
   34:47  error  'row' is defined but never used. Allowed unused args must match /^_ignored_.*/u   no-unused-vars
   35:1   error  Expected indentation of 2 spaces but found 4                                      indent
   36:1   error  Expected indentation of 4 spaces but found 7                                      indent
   37:1   error  Expected indentation of 2 spaces but found 4                                      indent
   38:1   error  Expected indentation of 4 spaces but found 7                                      indent
   39:1   error  Expected indentation of 2 spaces but found 4                                      indent
  556:1   error  Expected indentation of 14 spaces but found 16                                    indent
  557:1   error  Expected indentation of 14 spaces but found 16                                    indent
```

### Why are the changes needed?

To recover the build

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested:

```bash
 ./dev/lint-js
lint-js checks passed.
```

Closes #32541 from HyukjinKwon/SPARK-34764-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Kousuke Saruta <[email protected]>
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
Adds the exec loss reason to the Spark web UI & in doing so also fix the Kube integration to pass exec loss reason into core.

UI change:

![image](https://user-images.githubusercontent.com/59893/117045762-b975ba80-acc4-11eb-9679-8edab3cfadc2.png)

Debugging Spark jobs is *hard*, making it clearer why executors have exited could help.

Yes a new column on the executor page.

K8s unit test updated to validate exec loss reasons are passed through regardless of exec alive state, manual testing to validate the UI.

Closes apache#32436 from holdenk/SPARK-34764-propegate-reason-for-exec-loss.

Lead-authored-by: Holden Karau <[email protected]>
Co-authored-by: Holden Karau <[email protected]>
Signed-off-by: Holden Karau <[email protected]>

Fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants