Random constant unconditional output on cfg breaks nccl when using multi-gpu accelerate

### Describe the bug

Writing a custom encoder wrapped around accelerate to run with `--multi_gpu` on `nccl` backend I was getting an `nccl` error that I couldn't identify it's origin. The error wasn't appearing when running on 1 gpu (ofc without `--multi_gpu`). I managed to pin it down and it came from something rather unfortunate: training with classifier free guidance.

Here is a minimal sketch of the before after of the solving "bug":

This breaks (only on NCCL/`multi-gpu`):

```python
class CustomEncoder(ConfigMixin, ModelMixin):
    config_name = "custom_encoder.json"

    @register_to_config
    def __init__(self, **params):
        super().__init__()
        self.params = nn.Parameter(...) # define some parameters
        self.register_buffer("null", torch.zeros(...)) # define null output for guidance

    def forward(self, x):
        # with 0.05 probability we return an all zeros embedding
        if self.training and random.random() < 0.05:
            # classifier free guidance
            B = x.size(0)
            embedding = self.null.expand(B, -1, -1) # having this non-gradient line breaks NCCL
            # no link with params
        else:
            embedding = op(self.params, x)
            # link with params

        return embedding
```

This works perfectly, when removing the first part of the if clause:

```python
class CustomEncoder(ConfigMixin, ModelMixin):
    config_name = "custom_encoder.json"

    @register_to_config
    def __init__(self, **params):
        super().__init__()
        self.params = nn.Parameter(...) # define some parameters

    def forward(self, x):
        return op(self.params, x)
```

A workaround is to multiply the output of `op(self.params, x)` with zeros sampled through a pytorch random function.
Even if this is the proper way to do that - detecting this problem and throwing a warning message can really save time from users.

### Reproduction

I don't have the time to make an MRE, but it is definitely something that didn't work and I'm 100% sure it is because of this.

### Logs

```shell
Steps:   0%|          | 15/3840 [15:39<72:01:58, 67.80s/it, lr=0.0001, step_loss=0.295][E ProcessGroupNCCL.cpp:719] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=348, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803364 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:719] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=349, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802223 milliseconds before timing out.
Steps:   0%|          | 16/3840 [45:44<627:28:26, 590.72s/it, lr=0.0001, step_loss=0.292][E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=348, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803364 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=349, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802223 milliseconds before timing out.
```


### System Info

- `diffusers` version: 0.14.0
- Platform: Linux-4.18.0-372.41.1.el8_6.x86_64-x86_64-with-glibc2.28
- Python version: 3.9.12
- PyTorch version (GPU?): 1.11.0 (True)
- Huggingface_hub version: 0.13.3
- Transformers version: 4.27.4
- Accelerate version: 0.18.0
- xFormers version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Random constant unconditional output on cfg breaks nccl when using multi-gpu accelerate #3173

Describe the bug

Reproduction

Logs

System Info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Random constant unconditional output on cfg breaks nccl when using multi-gpu accelerate #3173

Description

Describe the bug

Reproduction

Logs

System Info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions