Commit 086f20a
authored
Fix test_barrier hang by using static global rank in ProcessGroupXCCL (#2036)
Fixes #1978
In ProcessGroupNCCL, `globalRank()` returns a static int globalRank,
which is first initialized by the ProcessGroup setup, so the globalRank
assigned to each thread matches the id of the CUDA device. However, we
were not using this same pattern for XCCL. Instead, we were just using
the assigned rank of the thread, which were not necessarily the same as
the globalRank.
The failing test `test_barrier` created two separate groups of 2 ranks
each, and then 4 threads called barrier, but all on the same 2-thread
group. Since initially the device id is not specified in this barrier
call, the thread attempts to "guess" the device index. In the previous
code, this guess would be 0 or 1, since the rank of each thread was not
actually the globalRank. In `barrier`, this guessed id was used to
initialize XCCLComm objects, and then call allreduce on a single element
tensor. However, this resulted in an allreduce call two times on each
device, which could result in a hang based on the execution order of the
4 threads.
With the update in this PR, PGXCCL now uses the static globalRank in the
same places as ProcessGroupNCCL, so the initialized XCCLComm objects are
for unique devices and allreduce doesn't call on the same device
multiple times.1 parent f301733 commit 086f20a
File tree
3 files changed
+10
-4
lines changed- src/xccl
3 files changed
+10
-4
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
352 | 352 | | |
353 | 353 | | |
354 | 354 | | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
355 | 360 | | |
356 | 361 | | |
357 | 362 | | |
| |||
379 | 384 | | |
380 | 385 | | |
381 | 386 | | |
382 | | - | |
| 387 | + | |
383 | 388 | | |
384 | 389 | | |
385 | 390 | | |
| |||
410 | 415 | | |
411 | 416 | | |
412 | 417 | | |
413 | | - | |
| 418 | + | |
414 | 419 | | |
415 | 420 | | |
416 | 421 | | |
| |||
2021 | 2026 | | |
2022 | 2027 | | |
2023 | 2028 | | |
2024 | | - | |
| 2029 | + | |
2025 | 2030 | | |
2026 | 2031 | | |
2027 | 2032 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
423 | 423 | | |
424 | 424 | | |
425 | 425 | | |
| 426 | + | |
426 | 427 | | |
427 | 428 | | |
428 | 429 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
39 | 39 | | |
40 | 40 | | |
41 | 41 | | |
42 | | - | |
| 42 | + | |
43 | 43 | | |
44 | 44 | | |
45 | 45 | | |
| |||
0 commit comments