Skip to content

Random constant unconditional output on cfg breaks nccl when using multi-gpu accelerate #3173

@ysig

Description

@ysig

Describe the bug

Writing a custom encoder wrapped around accelerate to run with --multi_gpu on nccl backend I was getting an nccl error that I couldn't identify it's origin. The error wasn't appearing when running on 1 gpu (ofc without --multi_gpu). I managed to pin it down and it came from something rather unfortunate: training with classifier free guidance.

Here is a minimal sketch of the before after of the solving "bug":

This breaks (only on NCCL/multi-gpu):

class CustomEncoder(ConfigMixin, ModelMixin):
    config_name = "custom_encoder.json"

    @register_to_config
    def __init__(self, **params):
        super().__init__()
        self.params = nn.Parameter(...) # define some parameters
        self.register_buffer("null", torch.zeros(...)) # define null output for guidance

    def forward(self, x):
        # with 0.05 probability we return an all zeros embedding
        if self.training and random.random() < 0.05:
            # classifier free guidance
            B = x.size(0)
            embedding = self.null.expand(B, -1, -1) # having this non-gradient line breaks NCCL
            # no link with params
        else:
            embedding = op(self.params, x)
            # link with params

        return embedding

This works perfectly, when removing the first part of the if clause:

class CustomEncoder(ConfigMixin, ModelMixin):
    config_name = "custom_encoder.json"

    @register_to_config
    def __init__(self, **params):
        super().__init__()
        self.params = nn.Parameter(...) # define some parameters

    def forward(self, x):
        return op(self.params, x)

A workaround is to multiply the output of op(self.params, x) with zeros sampled through a pytorch random function.
Even if this is the proper way to do that - detecting this problem and throwing a warning message can really save time from users.

Reproduction

I don't have the time to make an MRE, but it is definitely something that didn't work and I'm 100% sure it is because of this.

Logs

Steps:   0%|          | 15/3840 [15:39<72:01:58, 67.80s/it, lr=0.0001, step_loss=0.295][E ProcessGroupNCCL.cpp:719] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=348, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803364 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:719] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=349, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802223 milliseconds before timing out.
Steps:   0%|          | 16/3840 [45:44<627:28:26, 590.72s/it, lr=0.0001, step_loss=0.292][E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=348, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803364 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=349, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802223 milliseconds before timing out.

System Info

  • diffusers version: 0.14.0
  • Platform: Linux-4.18.0-372.41.1.el8_6.x86_64-x86_64-with-glibc2.28
  • Python version: 3.9.12
  • PyTorch version (GPU?): 1.11.0 (True)
  • Huggingface_hub version: 0.13.3
  • Transformers version: 4.27.4
  • Accelerate version: 0.18.0
  • xFormers version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleIssues that haven't received updates

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions