Commit 5f2ed50
[PGNCCL] Watchdog prints call-time traceback when reporting timeout (pytorch#139659)
### Motivation
Today, watchdog only reports that it found a collective timeout:
```
[rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out.
```
While this is nice, it is hard to associate the error with user's program or library stack.
### This PR
This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior.
The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users.
### Demo
[stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09).
```
TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py
```
`TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder.
Output:
```
[rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 bar from /data/users/kw2501/sync_async/repro.py:15
pytorch#3 foo from /data/users/kw2501/sync_async/repro.py:24
pytorch#4 main from /data/users/kw2501/sync_async/repro.py:34
pytorch#5 <module> from /data/users/kw2501/sync_async/repro.py:40
[rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 baz from /data/users/kw2501/sync_async/repro.py:20
pytorch#3 foo from /data/users/kw2501/sync_async/repro.py:26
pytorch#4 main from /data/users/kw2501/sync_async/repro.py:34
pytorch#5 <module> from /data/users/kw2501/sync_async/repro.py:40
```
From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert.
Pull Request resolved: pytorch#139659
Approved by: https://github.com/wconstab, https://github.com/fduwjj1 parent ee42a99 commit 5f2ed50
File tree
3 files changed
+79
-0
lines changed- torch/csrc/distributed/c10d
3 files changed
+79
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
376 | 376 | | |
377 | 377 | | |
378 | 378 | | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
379 | 406 | | |
380 | 407 | | |
381 | 408 | | |
| |||
495 | 522 | | |
496 | 523 | | |
497 | 524 | | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
| 540 | + | |
| 541 | + | |
498 | 542 | | |
499 | 543 | | |
500 | 544 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
663 | 663 | | |
664 | 664 | | |
665 | 665 | | |
| 666 | + | |
| 667 | + | |
| 668 | + | |
666 | 669 | | |
667 | 670 | | |
668 | 671 | | |
| |||
699 | 702 | | |
700 | 703 | | |
701 | 704 | | |
| 705 | + | |
| 706 | + | |
| 707 | + | |
| 708 | + | |
702 | 709 | | |
703 | 710 | | |
704 | 711 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
690 | 690 | | |
691 | 691 | | |
692 | 692 | | |
| 693 | + | |
| 694 | + | |
| 695 | + | |
| 696 | + | |
| 697 | + | |
| 698 | + | |
| 699 | + | |
| 700 | + | |
| 701 | + | |
| 702 | + | |
| 703 | + | |
| 704 | + | |
| 705 | + | |
| 706 | + | |
| 707 | + | |
| 708 | + | |
| 709 | + | |
| 710 | + | |
| 711 | + | |
| 712 | + | |
| 713 | + | |
| 714 | + | |
| 715 | + | |
| 716 | + | |
| 717 | + | |
| 718 | + | |
| 719 | + | |
| 720 | + | |
693 | 721 | | |
694 | 722 | | |
695 | 723 | | |
| |||
0 commit comments