DeepSpeed Training Deadlock

### Bug description

I'm using PyTorch Lightning combined with the DeepSpeed strategy (stage 2) to train on 8 V100 GPUs on a single node and I'm running into the following deadlock issue. If I get an OOM on one of the GPUs that does not correspond to the main process that started the job, then that "secondary process" will terminate but all other processes will end up waiting on it (i.e., 100% GPU utilization and nothing happening). This seems like a deadlock issue that lightning should be managing but I'm not sure where that is supposed to be handled and how. Aside from a potential resolution, pointers to how lightning is meant to handle this kind of deadlock and where that happens in the codebase would be super helpful. Thank you!

### Environment

<details>
  <summary>Current environment</summary>

```
#- Lightning Component: Trainer
#- PyTorch Lightning Version: 1.9.0
#- PyTorch Version: 1.13.0
#- Python version: 3.7
#- OS: Ubuntu 20
#- CUDA version: 11.6
#- GPU models and configuration: 8 V100s with 32GB each
```

</details>


cc @borda @awaelchli @justusschock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DeepSpeed Training Deadlock #16518

Bug description

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DeepSpeed Training Deadlock #16518

Description

Bug description

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions