-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
Milestone
Description
Bug description
Since PyTorch 1.13, we have observed that ModelCheckpoint and EarlyStopping callbacks would hit an undefined symbol error with Horovod strategy.
Details and examples are in horovod/horovod@e392eb9
It is reproducible with Torch 1.13
alone, but I think underneath, reduce_op
fromDDP
should be not mixed with Horovod
. This line in PTL hits the error.
How to reproduce the bug
from torch.distributed import ReduceOp
op = None
op in (ReduceOp.SUM, None)
Error messages and logs
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: __eq__(): incompatible function arguments. The following argument types are supported:
1. (self: torch._C._distributed_c10d.ReduceOp, arg0: c10d::ReduceOp::RedOpType) -> bool
2. (self: torch._C._distributed_c10d.ReduceOp, arg0: torch._C._distributed_c10d.ReduceOp) -> bool
Invoked with: <torch.distributed.distributed_c10d.ReduceOp object at 0x7fba78c9e0b0>, None
Environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 1.10): 1.13+
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):
More info
Comments and suggestions are welcome.
cc @awaelchli