-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
Description
Bug description
I'm using PyTorch Lightning combined with the DeepSpeed strategy (stage 2) to train on 8 V100 GPUs on a single node and I'm running into the following deadlock issue. If I get an OOM on one of the GPUs that does not correspond to the main process that started the job, then that "secondary process" will terminate but all other processes will end up waiting on it (i.e., 100% GPU utilization and nothing happening). This seems like a deadlock issue that lightning should be managing but I'm not sure where that is supposed to be handled and how. Aside from a potential resolution, pointers to how lightning is meant to handle this kind of deadlock and where that happens in the codebase would be super helpful. Thank you!
Environment
Current environment
#- Lightning Component: Trainer
#- PyTorch Lightning Version: 1.9.0
#- PyTorch Version: 1.13.0
#- Python version: 3.7
#- OS: Ubuntu 20
#- CUDA version: 11.6
#- GPU models and configuration: 8 V100s with 32GB each