Skip to content

DeepSpeed Training Deadlock #16518

@eaplatanios

Description

@eaplatanios

Bug description

I'm using PyTorch Lightning combined with the DeepSpeed strategy (stage 2) to train on 8 V100 GPUs on a single node and I'm running into the following deadlock issue. If I get an OOM on one of the GPUs that does not correspond to the main process that started the job, then that "secondary process" will terminate but all other processes will end up waiting on it (i.e., 100% GPU utilization and nothing happening). This seems like a deadlock issue that lightning should be managing but I'm not sure where that is supposed to be handled and how. Aside from a potential resolution, pointers to how lightning is meant to handle this kind of deadlock and where that happens in the codebase would be super helpful. Thank you!

Environment

Current environment
#- Lightning Component: Trainer
#- PyTorch Lightning Version: 1.9.0
#- PyTorch Version: 1.13.0
#- Python version: 3.7
#- OS: Ubuntu 20
#- CUDA version: 11.6
#- GPU models and configuration: 8 V100s with 32GB each

cc @Borda @awaelchli @justusschock

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions