-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Model] Add Support for Ovis1.6-Gemma2-9B Model #11240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This model implementation is coupling the image processing and model forwarding...
You can refer to the model implementation in llava.py
and phi3v.py
when adding model implementation.
any news? |
Hey @Isotr0py could you give this PR a review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the model implementation becomes better, there are still lots of things needed to be done:
- Update the documentation to mention this supported model in
docs/source/models/supported_models.md
- Add example in
examples/offline_inference/vision_language.py
, if this model support multi-image inputs, please also updateexamples/offline_inference/vision_language_multi_image.py
- Add model correctness tests in
tests/models/decoder_only/vision_language/test_models.py
and processor correctness test intests/models/multimodal/processing/test_common.py
- Update
tests/models/registry.py
with model information.
vllm/model_executor/models/ovis.py
Outdated
# def merge_multimodal( | ||
# self, | ||
# text_input_ids: torch.Tensor, | ||
# text_attention_masks: torch.Tensor, | ||
# text_labels: Optional[torch.Tensor], | ||
# pixel_values: List[Optional[torch.Tensor]], | ||
# left_padding: bool = False | ||
# ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this unused code.
Please address pre-commit linting errors as well. |
Thanks @Isotr0py for the review, I'll get back to it. |
will this PR cover also new Ovis 2 models? https://huggingface.co/collections/AIDC-AI/ovis2-67ab36c7e497429034874464 |
Signed-off-by: Player256 <[email protected]>
I'll add the tests for it. |
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Player256 I tried this PR, but it doesn't work. I managed to make the model loaded. But it seems that the multimodal processor implementation still can't work.
vllm/model_executor/models/ovis.py
Outdated
def get_replacement_ovis(image: PIL.Image.Image): | ||
_, image_placeholders = self.preprocess_image(image) | ||
|
||
return image_placeholders |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we re-process images here?
vllm/model_executor/models/ovis.py
Outdated
def get_image_size_with_most_features(self) -> ImageSize: | ||
return ImageSize(height=384,width=384) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that Ovis will use dynamic resize (https://huggingface.co/AIDC-AI/Ovis1.6-Llama3.2-3B/blob/b8d93d7468f47fd803eb26ec2c1bc2d7e5fba60e/modeling_ovis.py#L135-L159), does 384x384
image size really return most image _features from visual tokenizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey I referred to this paper where the authors fine-tuned ViT models with an input resolution of 384x384 for S/16 and B/16 models, while using 512x512 for L/16 models. This suggests that 384x384 would be an appropriate choice for SigLip feature extraction if you are using a similar model size (ViT-S or ViT-B).
2106.11297v4.pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean this model using dynamic preprocessing with aspect ratio, so pixel_values (num_patches, C, H, W)
can have dynamic shape on patch dim., then causing different seq_length on placeholder.
For example, given a 2048x2048
image, the pixel_values
has shape of (10, 3, 384, 384)
. The image size here should correspond to the longest placeholder.
Signed-off-by: Player256 <[email protected]>
@Isotr0py I am facing this issue in the OvisProcessor.
Somehow the |
This pull request has merge conflicts that must be resolved before it can be |
This pull request has merge conflicts that must be resolved before it can be |
Closing as superseded by #17861 |
This pull request addresses issue #9638 by adding support for the Ovis1.6-Gemma2-9B model.
FIX #8972
FIX #9638