-
Notifications
You must be signed in to change notification settings - Fork 24.9k
Closed
Labels
oncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🐛 Describe the bug
Since Torch 1.13, ReduceOP
type seems changed and the below scripts would throw out an error:
>>> from torch.distributed import ReduceOp
>>> op = None
>>> op in (ReduceOp.SUM, None)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: __eq__(): incompatible function arguments. The following argument types are supported:
1. (self: torch._C._distributed_c10d.ReduceOp, arg0: c10d::ReduceOp::RedOpType) -> bool
2. (self: torch._C._distributed_c10d.ReduceOp, arg0: torch._C._distributed_c10d.ReduceOp) -> bool
Invoked with: <torch.distributed.distributed_c10d.ReduceOp object at 0x7fba78c9e0b0>, None
This impacts Horovod and Lightning end-to-end run, see Lightning side issue Lightning-AI/pytorch-lightning#15802
Versions
1.13
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu
awaelchli
Metadata
Metadata
Assignees
Labels
oncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module