-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Description
Your current environment
The output of `python collect_env.py`
Your output of `python collect_env.py` here
use h100 4*80G on ubuntu 22.04
torch 2.4.0
torchvision 0.19.0
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 |
| N/A 36C P0 150W / 700W | 56882MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
Model Input Dumps
No response
🐛 Describe the bug
I tried adding --enforce-eager, and it worked perfectly. However, I’d like to test if vLLM can run without this flag, as I want to use torch.compile to speed up inference.
after run
vllm serve /models/Llama-3.2-90B-Vision-Instruct/ --dtype auto --tensor_parallel_size 4 --max-num-seqs 2 --gpu_memory_utilization 0.95 --max_model_len 8192 --max_seq_len_to_capture 8192
it report error:
Exception in worker VllmWorkerProcess while processing method initialize_cache.
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1795, in capture
output_hidden_or_intermediate_states = self.model(
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/vllm/model_executor/models/mllama.py", line 1233, in forward
skip_cross_attention = max(attn_metadata.encoder_seq_lens) == 0
RuntimeError: CUDA error: operation not permitted when stream is capturing
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.9/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
output = executor(*args, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/vllm/worker/worker.py", line 271, in initialize_cache
self._warm_up_model()
File "/root/anaconda3/lib/python3.9/site-packages/vllm/worker/worker.py", line 287, in _warm_up_model
self.model_runner.capture_model(self.gpu_cache)
File "/root/anaconda3/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1515, in capture_model
graph_runner.capture(**capture_inputs)
File "/root/anaconda3/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1818, in capture
gc.collect()
File "/root/anaconda3/lib/python3.9/site-packages/torch/cuda/graphs.py", line 185, in exit
self.cuda_graph.capture_end()
File "/root/anaconda3/lib/python3.9/site-packages/torch/cuda/graphs.py", line 83, in capture_end
super().capture_end()
RuntimeError: CUDA error: operation failed due to a previous error during capture
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.