-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Model] Support LLAVA ONEVISION 1.5 #25666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Model] Support LLAVA ONEVISION 1.5 #25666
Conversation
Signed-off-by: KillTheHostage <[email protected]>
Signed-off-by: KillTheHostage <[email protected]>
Signed-off-by: KillTheHostage <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the LLAVA ONEVISION 1.5 model. The changes include the model implementation, an example for offline inference, and updates to documentation and the model registry. The implementation looks mostly good, but I've identified a critical correctness issue in the video frame padding logic, a significant performance issue in the vision attention mechanism due to redundant distributed operations, and a typo in the model's module name that affects consistency. Addressing these points will improve the correctness, performance, and maintainability of the new model support.
# NOTE: Frames are padded to be divisible by `temporal_patch_size` | ||
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py#L294 | ||
padded_num_frames = num_frames + num_frames % temporal_patch_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The padding logic for padded_num_frames
is incorrect for LlavaOnevision1_5
. The current implementation num_frames + num_frames % temporal_patch_size
is copied from Qwen2VLImageProcessor
and does not correctly pad the frames to be divisible by temporal_patch_size
. The LlavaOnevision1_5ImageProcessor
overrides this with the correct logic. This bug can lead to incorrect calculations for the number of vision tokens.
The correct logic should pad the number of frames to the next multiple of temporal_patch_size
if it's not already a multiple. I've also updated the comment to point to the correct implementation.
# NOTE: Frames are padded to be divisible by `temporal_patch_size`
# https://github.com/huggingface/transformers/pull/41095/files#diff-974f33e383120f288a5003517042578059639561a842790263503943399a0170R105
padded_num_frames = num_frames + (temporal_patch_size - num_frames % temporal_patch_size) % temporal_patch_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The processing logic of LlavaOnevision1_5ImageProcessor is not significantly different from that of Qwen2VLImageProcessor, so the code was directly copied.
def _split_qkv(self, qkv: torch.Tensor) -> tuple[torch.Tensor]: | ||
seq_len, _ = qkv.shape | ||
if self.tp_size > 1: | ||
qkv = self._all_gather_tensor(qkv, self.qkv.hidden_size, | ||
self.tp_size) | ||
qkv = qkv.reshape(qkv.shape[0], 1, -1) | ||
q, k, v = qkv.chunk(3, dim=2) | ||
if self.tp_size > 1: | ||
splitter = partial(dist_utils.split_tensor_along_last_dim, | ||
num_partitions=self.tp_size) | ||
q = splitter(q)[self.tp_rank] | ||
k = splitter(k)[self.tp_rank] | ||
v = splitter(v)[self.tp_rank] | ||
new_shape = (seq_len, self.num_attn_heads_per_partition, | ||
self.hidden_size_per_attn_head) | ||
q, k, v = (x.view(*new_shape) for x in (q, k, v)) | ||
return q, k, v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _split_qkv
method is unnecessarily complex and inefficient. It performs an all_gather
operation (via _all_gather_tensor
) to reconstruct the full QKV tensor and then splits it again, which is redundant because QKVParallelLinear
already provides sharded outputs. This introduces unnecessary communication overhead.
This can be simplified to directly chunk and reshape the sharded qkv
tensor. The _all_gather_tensor
method at lines 228-246 can then be removed.
def _split_qkv(self, qkv: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
seq_len, _ = qkv.shape
q, k, v = qkv.chunk(3, dim=-1)
new_shape = (seq_len, self.num_attn_heads_per_partition,
self.hidden_size_per_attn_head)
q = q.contiguous().view(*new_shape)
k = k.contiguous().view(*new_shape)
v = v.contiguous().view(*new_shape)
return q, k, v
"LlavaNextForConditionalGeneration": ("llava_next", "LlavaNextForConditionalGeneration"), # noqa: E501 | ||
"LlavaNextVideoForConditionalGeneration": ("llava_next_video", "LlavaNextVideoForConditionalGeneration"), # noqa: E501 | ||
"LlavaOnevisionForConditionalGeneration": ("llava_onevision", "LlavaOnevisionForConditionalGeneration"), # noqa: E501 | ||
"LlavaOnevision1_5_ForConditionalGeneration": ("llava_onevison1_5", "LlavaOnevision1_5_ForConditionalGeneration"), # noqa: E501 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a typo in the module name: llava_onevison1_5
should be llava_onevision1_5
. This should be corrected here, and the corresponding model file vllm/model_executor/models/llava_onevison1_5.py
should be renamed to vllm/model_executor/models/llava_onevision1_5.py
for consistency.
"LlavaOnevision1_5_ForConditionalGeneration": ("llava_onevison1_5", "LlavaOnevision1_5_ForConditionalGeneration"), # noqa: E501 | |
"LlavaOnevision1_5_ForConditionalGeneration": ("llava_onevision1_5", "LlavaOnevision1_5_ForConditionalGeneration"), # noqa: E501 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you address this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing this model, can you also add it to tests/models/registry.py
and tests/models/multimodal/generation/test_common.py
to check the correctness?
Hello, during the execution of pytest on |
Purpose
This PR adds a new model structure: llavaonevision 1.5.
The corresponding HuggingFace model address is: https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct
The corresponding GitHub repository is: https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5
The associated HuggingFace commit address is: huggingface/transformers#41095
Test Plan
Serving test
To start the server, use the command:
vllm serve ./llavaonevision1_5_8B
, which defaults to using v1 mode and a single GPU. Then, usecurl
to make API requests. The request command is as follows:In the statement, http://127.0.0.1:8080 is an image server written using Flask. The image is as follows:

I also tested
vllm serve ./llavaonevision1_5_8B --tensor-parallel 2
using the V1 mode with two GPUs.Offline test
Use the updated offline demo from this PR for testing, with the file path being
examples/offline_inference/vision_language.py
.However, some modifications were made for local testing, and the modified code is as follows:I used the offline infer to test the compatibility of v0.
Test Result
The image captures a scene with a tower in the background and cherry blossom trees in the foreground.
The image captures a beautiful scene of cherry blossoms in full bloom, with a tall tower in the background. The blossoms are pink and appear to be in the foreground, with the tower visible through the branches. The sky is clear and blue, creating a serene atmosphere.
The image captures a beautiful scene of a tower, possibly a communication or observation tower, partially obscured by the branches of a tree adorned with pink blossoms. The blossoms suggest that the photo was taken during springtime. The tower is tall and slender, with a conical top, and it stands out against the clear
The image captures a beautiful scene of a tower, possibly a communication or observation tower, partially obscured by branches of a tree adorned with pink blossoms. The blossoms suggest that the photo was taken during springtime. The tower's design is modern, with a cylindrical shape and a pointed top, which is a common architectural