Skip to content

[ReduceOP] Type bug since Torch 1.13  #90072

@chongxiaoc

Description

@chongxiaoc

🐛 Describe the bug

Since Torch 1.13, ReduceOP type seems changed and the below scripts would throw out an error:

        >>> from torch.distributed import ReduceOp
        >>> op = None
        >>> op in (ReduceOp.SUM, None)
        Traceback (most recent call last):
            File "<stdin>", line 1, in <module>
            TypeError: __eq__(): incompatible function arguments. The following argument types are supported:
            1. (self: torch._C._distributed_c10d.ReduceOp, arg0: c10d::ReduceOp::RedOpType) -> bool
            2. (self: torch._C._distributed_c10d.ReduceOp, arg0: torch._C._distributed_c10d.ReduceOp) -> bool
        Invoked with: <torch.distributed.distributed_c10d.ReduceOp object at 0x7fba78c9e0b0>, None

This impacts Horovod and Lightning end-to-end run, see Lightning side issue Lightning-AI/pytorch-lightning#15802

Versions

1.13

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu

Metadata

Metadata

Assignees

No one assigned

    Labels

    oncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions