Skip to content

[Horovod] ModelCheckpoint and EarlyStopping CBs hit errors with Torch 1.13+ #15802

@chongxiaoc

Description

@chongxiaoc

Bug description

Since PyTorch 1.13, we have observed that ModelCheckpoint and EarlyStopping callbacks would hit an undefined symbol error with Horovod strategy.

Details and examples are in horovod/horovod@e392eb9

It is reproducible with Torch 1.13 alone, but I think underneath, reduce_op fromDDP should be not mixed with Horovod. This line in PTL hits the error.

https://github.com/Lightning-AI/lightning/blob/master/src/pytorch_lightning/strategies/horovod.py#L179

How to reproduce the bug

from torch.distributed import ReduceOp

op = None
op in (ReduceOp.SUM, None)

Error messages and logs

        Traceback (most recent call last):
            File "<stdin>", line 1, in <module>
            TypeError: __eq__(): incompatible function arguments. The following argument types are supported:
            1. (self: torch._C._distributed_c10d.ReduceOp, arg0: c10d::ReduceOp::RedOpType) -> bool
            2. (self: torch._C._distributed_c10d.ReduceOp, arg0: torch._C._distributed_c10d.ReduceOp) -> bool
        Invoked with: <torch.distributed.distributed_c10d.ReduceOp object at 0x7fba78c9e0b0>, None

Environment


#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 1.10): 1.13+
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

Comments and suggestions are welcome.

cc @awaelchli

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions