-
Notifications
You must be signed in to change notification settings - Fork 6.1k
Description
Describe the bug
Writing a custom encoder wrapped around accelerate to run with --multi_gpu
on nccl
backend I was getting an nccl
error that I couldn't identify it's origin. The error wasn't appearing when running on 1 gpu (ofc without --multi_gpu
). I managed to pin it down and it came from something rather unfortunate: training with classifier free guidance.
Here is a minimal sketch of the before after of the solving "bug":
This breaks (only on NCCL/multi-gpu
):
class CustomEncoder(ConfigMixin, ModelMixin):
config_name = "custom_encoder.json"
@register_to_config
def __init__(self, **params):
super().__init__()
self.params = nn.Parameter(...) # define some parameters
self.register_buffer("null", torch.zeros(...)) # define null output for guidance
def forward(self, x):
# with 0.05 probability we return an all zeros embedding
if self.training and random.random() < 0.05:
# classifier free guidance
B = x.size(0)
embedding = self.null.expand(B, -1, -1) # having this non-gradient line breaks NCCL
# no link with params
else:
embedding = op(self.params, x)
# link with params
return embedding
This works perfectly, when removing the first part of the if clause:
class CustomEncoder(ConfigMixin, ModelMixin):
config_name = "custom_encoder.json"
@register_to_config
def __init__(self, **params):
super().__init__()
self.params = nn.Parameter(...) # define some parameters
def forward(self, x):
return op(self.params, x)
A workaround is to multiply the output of op(self.params, x)
with zeros sampled through a pytorch random function.
Even if this is the proper way to do that - detecting this problem and throwing a warning message can really save time from users.
Reproduction
I don't have the time to make an MRE, but it is definitely something that didn't work and I'm 100% sure it is because of this.
Logs
Steps: 0%| | 15/3840 [15:39<72:01:58, 67.80s/it, lr=0.0001, step_loss=0.295][E ProcessGroupNCCL.cpp:719] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=348, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803364 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:719] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=349, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802223 milliseconds before timing out.
Steps: 0%| | 16/3840 [45:44<627:28:26, 590.72s/it, lr=0.0001, step_loss=0.292][E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=348, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803364 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=349, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802223 milliseconds before timing out.
System Info
diffusers
version: 0.14.0- Platform: Linux-4.18.0-372.41.1.el8_6.x86_64-x86_64-with-glibc2.28
- Python version: 3.9.12
- PyTorch version (GPU?): 1.11.0 (True)
- Huggingface_hub version: 0.13.3
- Transformers version: 4.27.4
- Accelerate version: 0.18.0
- xFormers version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: yes