Torch support for CUDA and DDP

Trying to run some basic examples on a system with 4 GH200 modules using a container image based on `nvcr.io/nvidia/pytorch:25.01-py3` with viztracer 1.0.1 installed on top fails for me as follows.

For moving tensors to a CUDA device with `test_cuda.py`

```
import torch
from viztracer import VizTracer

with VizTracer(log_torch=True) as tracer:
    initial_value = torch.tensor([3.0]).cuda(0)
    print("done!")
```
I'm getting
```
/workspace$ python test_cuda.py 
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 330, in _lazy_init
    queued_call()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 1567, in _register_triton_kernels
    torch._TritonLibrary.registerOp(
  File "/usr/local/lib/python3.12/dist-packages/torch/__init__.py", line 2585, in registerOp
    cls.lib.define(full_schema)
  File "/usr/local/lib/python3.12/dist-packages/torch/library.py", line 153, in define
    result = self.m.define(schema, alias_analysis, tuple(tags))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: VizTracer: Unexpected type. Might be an event mismatch.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/test_cuda.py", line 5, in <module>
    initial_value = torch.tensor([3.0]).cuda(0)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 336, in _lazy_init
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: VizTracer: Unexpected type. Might be an event mismatch.

CUDA call was originally invoked at:

  File "/workspace/test_cuda.py", line 1, in <module>
    import torch
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 995, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/usr/local/lib/python3.12/dist-packages/torch/__init__.py", line 2007, in <module>
    _C._initExtension(_manager_path())
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 995, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 1585, in <module>
    _lazy_call(_register_triton_kernels)
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 261, in _lazy_call
    _queued_calls.append((callable, traceback.format_stack()))


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 330, in _lazy_init
    queued_call()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 1567, in _register_triton_kernels
    torch._TritonLibrary.registerOp(
  File "/usr/local/lib/python3.12/dist-packages/torch/__init__.py", line 2585, in registerOp
    cls.lib.define(full_schema)
  File "/usr/local/lib/python3.12/dist-packages/torch/library.py", line 153, in define
    result = self.m.define(schema, alias_analysis, tuple(tags))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Tried to register an operator (triton::_triton_bsr_dense_mm_out(Tensor bsr, Tensor dense, *, Tensor(a!) out) -> Tensor(a!)) with the same name and overload name multiple times. Each overload's schema should only be registered with a single call to def(). Duplicate registration: registered at /dev/null:2578. Original registration: registered at /dev/null:2578

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/test_cuda.py", line 4, in <module>
    with VizTracer(log_torch=True) as tracer:
  File "/usr/local/lib/python3.12/dist-packages/viztracer/viztracer.py", line 170, in __exit__
    self.stop()
  File "/usr/local/lib/python3.12/dist-packages/viztracer/viztracer.py", line 241, in stop
    self.torch_profile.__exit__(None, None, None)
  File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 777, in __exit__
    self.stop()
  File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 793, in stop
    self._transit_action(self.current_action, None)
  File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 836, in _transit_action
    action()
  File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 239, in stop_trace
    self.profiler.__exit__(None, None, None)
  File "/usr/local/lib/python3.12/dist-packages/torch/autograd/profiler.py", line 369, in __exit__
    device_module.synchronize()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 965, in synchronize
    _lazy_init()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 336, in _lazy_init
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: Tried to register an operator (triton::_triton_bsr_dense_mm_out(Tensor bsr, Tensor dense, *, Tensor(a!) out) -> Tensor(a!)) with the same name and overload name multiple times. Each overload's schema should only be registered with a single call to def(). Duplicate registration: registered at /dev/null:2578. Original registration: registered at /dev/null:2578

CUDA call was originally invoked at:

  File "/workspace/test_cuda.py", line 1, in <module>
    import torch
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 995, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/usr/local/lib/python3.12/dist-packages/torch/__init__.py", line 2007, in <module>
    _C._initExtension(_manager_path())
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 995, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 1585, in <module>
    _lazy_call(_register_triton_kernels)
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 261, in _lazy_call
    _queued_calls.append((callable, traceback.format_stack()))

[nid006679:53988:0:53988] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0xf86a280)
==== backtrace (tid:  53988) ====
 0  /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2cc) [0x4000c1cd14dc]
 1  /opt/hpcx/ucx/lib/libucs.so.0(+0x3168c) [0x4000c1cd168c]
 2  /opt/hpcx/ucx/lib/libucs.so.0(+0x319b8) [0x4000c1cd19b8]
 3  linux-vdso.so.1(__kernel_rt_sigreturn+0) [0x4000239507dc]
 4  [0xf86a280]
=================================
Segmentation fault (core dumped)
```
and for DDP with `test_ddp.py`
```
import torch
import torch.distributed as dist
from viztracer import VizTracer

with VizTracer(log_torch=True) as tracer:
    dist.init_process_group(backend='nccl', init_method='env://')   #  having set DDP env vars
    print("done!")
```
it is
```
/workspace$ MASTER_ADDR=$(hostname) MASTER_PORT=29500 RANK=0 WORLD_SIZE=1 LOCAL_RANK=1 LOCAL_WORLD_SIZE=1 python test_ddp.py 
Loading finish                                        
Total Entries: 73                                                               
Use the following command to open the report:
vizviewer /workspace/viztracer.json
Traceback (most recent call last):
  File "/workspace/test_ddp.py", line 6, in <module>
    dist.init_process_group(backend='nccl', init_method='env://')   #  having set DDP env vars
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 94, in wrapper
    with _WaitCounter(f"pytorch.wait_counter.c10d.{func.__name__}").guard():
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: VizTracer: Unexpected type. Might be an event mismatch.
```

Using only the CPU and no DDP, a simple test runs fine. Does viztracer support CUDA and DDP workloads with Pytorch?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Torch support for CUDA and DDP #552

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Torch support for CUDA and DDP #552

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions