Skip to content

[Bug]: RuntimeError: CUDA error: operation not permitted when stream is capturing when serving llama 3.2 90b #10445

@bingwork

Description

@bingwork

Your current environment

The output of `python collect_env.py`
Your output of `python collect_env.py` here

use h100 4*80G on ubuntu 22.04

torch                             2.4.0
torchvision                       0.19.0


+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:19:00.0 Off |                    0 |
| N/A   36C    P0            150W /  700W |   56882MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+


Model Input Dumps

No response

🐛 Describe the bug

I tried adding --enforce-eager, and it worked perfectly. However, I’d like to test if vLLM can run without this flag, as I want to use torch.compile to speed up inference.

after run

vllm serve /models/Llama-3.2-90B-Vision-Instruct/ --dtype auto --tensor_parallel_size 4 --max-num-seqs 2 --gpu_memory_utilization 0.95 --max_model_len 8192 --max_seq_len_to_capture 8192

it report error:

Exception in worker VllmWorkerProcess while processing method initialize_cache.
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1795, in capture
output_hidden_or_intermediate_states = self.model(
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/vllm/model_executor/models/mllama.py", line 1233, in forward
skip_cross_attention = max(attn_metadata.encoder_seq_lens) == 0
RuntimeError: CUDA error: operation not permitted when stream is capturing
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/root/anaconda3/lib/python3.9/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
output = executor(*args, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/vllm/worker/worker.py", line 271, in initialize_cache
self._warm_up_model()
File "/root/anaconda3/lib/python3.9/site-packages/vllm/worker/worker.py", line 287, in _warm_up_model
self.model_runner.capture_model(self.gpu_cache)
File "/root/anaconda3/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1515, in capture_model
graph_runner.capture(**capture_inputs)
File "/root/anaconda3/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1818, in capture
gc.collect()
File "/root/anaconda3/lib/python3.9/site-packages/torch/cuda/graphs.py", line 185, in exit
self.cuda_graph.capture_end()
File "/root/anaconda3/lib/python3.9/site-packages/torch/cuda/graphs.py", line 83, in capture_end
super().capture_end()
RuntimeError: CUDA error: operation failed due to a previous error during capture
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions