You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is some problem with how the chat template is resolved for vision language models when you compare vllm's openai server vs. LLM apis.
The issue is rooted in some ad-hoc ops that are happening in openAI server before passing prompts to the engine that might be missing from the LLM engine request submission path on ray serve llm.
Versions / Dependencies
N/A
Reproduction script
Compare the following vllm cmd with the corresponding serve llm deployment:
from typing_extensions import runtime
from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app
llm_config = LLMConfig(
model_loading_config=dict(
model_id="OpenGVLab/InternVL2_5-1B-MPO",
),
deployment_config=dict(
autoscaling_config=dict(
min_replicas=1, max_replicas=2,
)
),
runtime_env={
"env_vars": {
"VLLM_USE_V1": "1"
}
},
# You can customize the engine arguments (e.g. vLLM engine kwargs)
engine_kwargs=dict(
tensor_parallel_size=1,
max_model_len=32768,
dtype="half",
trust_remote_code=True,
),
)
app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)
fromopenaiimportOpenAIclient=OpenAI(
# Replace the URLbase_url="http://localhost:8000/v1",
api_key="NOT A REAL KEY",
)
chat_response=client.chat.completions.create(
model="OpenGVLab/InternVL2_5-1B-MPO",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"
},
},
{"type": "text", "text": "What is the text in the illustrate?"},
],
},
],
max_tokens=10,
)
ifhasattr(chat_response, "choices"):
print(chat_response.choices[0].message.content)
For example for this particular model here is the diff in the conversation that tokenizer.apply_chat_template() gets applied to:
on vllm:
conversation=[{'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': '<image>\nWhat is the text in the illustrate?'}]
Image is replaced with tag. This is done in the openAI server logic and then passed into the tokenizer's chat_template
on serve llm:
[Message(role='system', content='You are a helpful assistant.'), Message(role='user', content=[Image(field='image_url', image_url={'url': 'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png'}), Text(field='text', type='text', text='What is the text in the illustrate?')])]
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered:
kouroshHakha
added
bug
Something that is supposed to be working; but isn't
P0
Issues that should be fixed in short order
labels
Apr 25, 2025
What happened + What you expected to happen
There is some problem with how the chat template is resolved for vision language models when you compare vllm's openai server vs. LLM apis.
The issue is rooted in some ad-hoc ops that are happening in openAI server before passing prompts to the engine that might be missing from the LLM engine request submission path on ray serve llm.
Versions / Dependencies
N/A
Reproduction script
Compare the following vllm cmd with the corresponding serve llm deployment:
VLLM Code
SERVE LLM CODE
For example for this particular model here is the diff in the conversation that
tokenizer.apply_chat_template()
gets applied to:on vllm:
Image is replaced with
tag. This is done in the openAI server logic and then passed into the tokenizer's chat_template
on serve llm:
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: