[Bug]: RuntimeError: CUDA error: operation not permitted when stream is capturing when serving llama 3.2 90b

### Your current environment

<details>
<summary>The output of `python collect_env.py`</summary>

```text
Your output of `python collect_env.py` here

use h100 4*80G on ubuntu 22.04

torch                             2.4.0
torchvision                       0.19.0


+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:19:00.0 Off |                    0 |
| N/A   36C    P0            150W /  700W |   56882MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+


```

</details>


### Model Input Dumps

_No response_

### 🐛 Describe the bug

I tried adding --enforce-eager, and it worked perfectly. However, I’d like to test if vLLM can run without this flag, as I want to use torch.compile to speed up inference.

after run

vllm serve /models/Llama-3.2-90B-Vision-Instruct/ --dtype auto --tensor_parallel_size 4 --max-num-seqs 2 --gpu_memory_utilization 0.95 --max_model_len 8192 --max_seq_len_to_capture 8192


it report error:

 Exception in worker VllmWorkerProcess while processing method initialize_cache.
 Traceback (most recent call last):
   File "/root/anaconda3/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1795, in capture   
     output_hidden_or_intermediate_states = self.model(
   File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
   File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl 
     return forward_call(*args, **kwargs)
   File "/root/anaconda3/lib/python3.9/site-packages/vllm/model_executor/models/mllama.py", line 1233, in forward
     skip_cross_attention = max(attn_metadata.encoder_seq_lens) == 0
 RuntimeError: CUDA error: operation not permitted when stream is capturing
 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


 During handling of the above exception, another exception occurred:

 Traceback (most recent call last):
   File "/root/anaconda3/lib/python3.9/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
     output = executor(*args, **kwargs)
   File "/root/anaconda3/lib/python3.9/site-packages/vllm/worker/worker.py", line 271, in initialize_cache 
     self._warm_up_model()
   File "/root/anaconda3/lib/python3.9/site-packages/vllm/worker/worker.py", line 287, in _warm_up_model   
     self.model_runner.capture_model(self.gpu_cache)
   File "/root/anaconda3/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
     return func(*args, **kwargs)
   File "/root/anaconda3/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1515, in capture_model
     graph_runner.capture(**capture_inputs)
   File "/root/anaconda3/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1818, in capture   
     gc.collect()
   File "/root/anaconda3/lib/python3.9/site-packages/torch/cuda/graphs.py", line 185, in __exit__
     self.cuda_graph.capture_end()
   File "/root/anaconda3/lib/python3.9/site-packages/torch/cuda/graphs.py", line 83, in capture_end        
     super().capture_end()
 RuntimeError: CUDA error: operation failed due to a previous error during capture
 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: RuntimeError: CUDA error: operation not permitted when stream is capturing when serving llama 3.2 90b #10445

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: RuntimeError: CUDA error: operation not permitted when stream is capturing when serving llama 3.2 90b #10445

Description

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions