-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
with vllm v0.2.7, I saw the nccl hanging for allreduce:
�[36m(RayWorkerVllm pid=5085)�[0m [E ProcessGroupNCCL.cpp:475] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20518, OpType=ALLREDUCE, NumelIn=106496, NumelOut=106496, Timeout(ms)=1800000) ran for 1800270 milliseconds before timing out.
after switching to v0.3.0(with custom all reduce), it's gather
(RayWorkerVllm pid=4775) [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=369526, OpType=GATHER, NumelIn=4000, NumelOut=0, Timeout(ms)=1800000) ran for 1800252 milliseconds before timing out.
leocnj
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working