-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
Repro command below.
🐛 Describe the bug
Attempting to serve meta-llama/Llama-3.2-11B-Vision-Instruct
with recent vLLM (>=v0.7.3), results in the error below during the execution of determine_num_available_blocks()
during bootup
$ vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --max-num-seqs 8
Traceback (most recent call last):
File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 400, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 125, in from_engine_args
return cls(ipc_path=ipc_path,
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 77, in __init__
self.engine = LLMEngine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/llm_engine.py", line 277, in __init__
self._initialize_kv_caches()
File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/llm_engine.py", line 426, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/vllm/executor/executor_base.py", line 102, in determine_num_available_blocks
results = self.collective_rpc("determine_num_available_blocks")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/vllm/executor/executor_base.py", line 316, in collective_rpc
return self._run_workers(method, *args, **(kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
driver_worker_output = run_method(self.driver_worker, sent_method,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/vllm/utils.py", line 2196, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
self.model_runner.profile_run()
File "/opt/vllm/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/vllm/worker/enc_dec_model_runner.py", line 341, in profile_run
self.execute_model(model_input, kv_caches, intermediate_tensors)
File "/opt/vllm/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/vllm/worker/enc_dec_model_runner.py", line 182, in execute_model
hidden_or_intermediate_states = model_executable(
^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/vllm/model_executor/models/mllama.py", line 1392, in forward
assert actual_len >= last_group_len
I have done some investigations, but do not have a fix yet... Here is what I have found:
- the error occurs because the dummy encoder sequences constructed for profiling are longer than the actual encoder len computed in mllama; for the single-image requests, this means greater than 6404 tokens
- serving the model works as long as
max_seq_len / max_num_seqs <= 6404
; with the full seq length--max-num-seq=21
works - I think this bug was introduced in [VLM] Implement merged multimodal processor for Mllama #11427
- before this PR there was a
dummy_encoder_data_for_mllama
function responsible for constructing the dummy data
- before this PR there was a
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Reichenbachian and hxhcreate
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working