Skip to content

Conversation

Player256
Copy link

@Player256 Player256 commented Dec 16, 2024

This pull request addresses issue #9638 by adding support for the Ovis1.6-Gemma2-9B model.

FIX #8972
FIX #9638

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@Isotr0py Isotr0py self-assigned this Dec 18, 2024
@Player256 Player256 marked this pull request as ready for review January 4, 2025 14:02
Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model implementation is coupling the image processing and model forwarding...

You can refer to the model implementation in llava.py and phi3v.py when adding model implementation.

@Swipe4057
Copy link

any news?

@Player256
Copy link
Author

Hey @Isotr0py could you give this PR a review?

Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the model implementation becomes better, there are still lots of things needed to be done:

  • Update the documentation to mention this supported model in docs/source/models/supported_models.md
  • Add example in examples/offline_inference/vision_language.py, if this model support multi-image inputs, please also update examples/offline_inference/vision_language_multi_image.py
  • Add model correctness tests in tests/models/decoder_only/vision_language/test_models.py and processor correctness test in tests/models/multimodal/processing/test_common.py
  • Update tests/models/registry.py with model information.

Comment on lines 456 to 463
# def merge_multimodal(
# self,
# text_input_ids: torch.Tensor,
# text_attention_masks: torch.Tensor,
# text_labels: Optional[torch.Tensor],
# pixel_values: List[Optional[torch.Tensor]],
# left_padding: bool = False
# ):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this unused code.

@Isotr0py
Copy link
Member

Isotr0py commented Feb 4, 2025

Please address pre-commit linting errors as well.

@Player256
Copy link
Author

Please address pre-commit linting errors as well.

Thanks @Isotr0py for the review, I'll get back to it.

@ismael-dm
Copy link
Contributor

will this PR cover also new Ovis 2 models? https://huggingface.co/collections/AIDC-AI/ovis2-67ab36c7e497429034874464

@mergify mergify bot added the documentation Improvements or additions to documentation label Feb 27, 2025
@Player256 Player256 marked this pull request as draft February 27, 2025 09:06
@Player256
Copy link
Author

I'll add the tests for it.

Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Player256 I tried this PR, but it doesn't work. I managed to make the model loaded. But it seems that the multimodal processor implementation still can't work.

Comment on lines 349 to 352
def get_replacement_ovis(image: PIL.Image.Image):
_, image_placeholders = self.preprocess_image(image)

return image_placeholders
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we re-process images here?

Comment on lines 292 to 293
def get_image_size_with_most_features(self) -> ImageSize:
return ImageSize(height=384,width=384)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that Ovis will use dynamic resize (https://huggingface.co/AIDC-AI/Ovis1.6-Llama3.2-3B/blob/b8d93d7468f47fd803eb26ec2c1bc2d7e5fba60e/modeling_ovis.py#L135-L159), does 384x384 image size really return most image _features from visual tokenizer?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey I referred to this paper where the authors fine-tuned ViT models with an input resolution of 384x384 for S/16 and B/16 models, while using 512x512 for L/16 models. This suggests that 384x384 would be an appropriate choice for SigLip feature extraction if you are using a similar model size (ViT-S or ViT-B).
2106.11297v4.pdf

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean this model using dynamic preprocessing with aspect ratio, so pixel_values (num_patches, C, H, W) can have dynamic shape on patch dim., then causing different seq_length on placeholder.

For example, given a 2048x2048 image, the pixel_values has shape of (10, 3, 384, 384). The image size here should correspond to the longest placeholder.

@Player256
Copy link
Author

Player256 commented Mar 2, 2025

@Isotr0py I am facing this issue in the OvisProcessor.

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ubuntu/oracle/vllm/test.py", line 5, in <module>
[rank0]:     model = LLM(model=model_name,max_model_len=8192)
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ubuntu/oracle/vllm/vllm/utils.py", line 1045, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ubuntu/oracle/vllm/vllm/entrypoints/llm.py", line 243, in __init__
[rank0]:     self.llm_engine = self.engine_class.from_engine_args(
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ubuntu/oracle/vllm/vllm/engine/llm_engine.py", line 494, in from_engine_args
[rank0]:     engine = cls(
[rank0]:              ^^^^
[rank0]:   File "/home/ubuntu/oracle/vllm/vllm/engine/llm_engine.py", line 277, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/home/ubuntu/oracle/vllm/vllm/engine/llm_engine.py", line 426, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ubuntu/oracle/vllm/vllm/executor/executor_base.py", line 102, in determine_num_available_blocks
[rank0]:     results = self.collective_rpc("determine_num_available_blocks")
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ubuntu/oracle/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
[rank0]:     answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ubuntu/oracle/vllm/vllm/utils.py", line 2232, in run_method
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ubuntu/oracle/vllm/vllm/worker/worker.py", line 229, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/opt/conda/envs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ubuntu/oracle/vllm/vllm/worker/model_runner.py", line 1243, in profile_run
[rank0]:     self._dummy_run(max_num_batched_tokens, max_num_seqs)
[rank0]:   File "/home/ubuntu/oracle/vllm/vllm/worker/model_runner.py", line 1308, in _dummy_run
[rank0]:     .dummy_data_for_profiling(self.model_config,
[rank0]:      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ubuntu/oracle/vllm/vllm/inputs/registry.py", line 336, in dummy_data_for_profiling
[rank0]:     dummy_data = profiler.get_dummy_data(
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ubuntu/oracle/vllm/vllm/multimodal/profiling.py", line 168, in get_dummy_data
[rank0]:     mm_inputs = self._get_dummy_mm_inputs(seq_len, mm_counts)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ubuntu/oracle/vllm/vllm/multimodal/profiling.py", line 141, in _get_dummy_mm_inputs
[rank0]:     return self.processor.apply(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ubuntu/oracle/vllm/vllm/multimodal/processing.py", line 1476, in apply
[rank0]:     ) = self._cached_apply_hf_processor(
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ubuntu/oracle/vllm/vllm/multimodal/processing.py", line 1268, in _cached_apply_hf_processor
[rank0]:     ) = self._apply_hf_processor_main(
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ubuntu/oracle/vllm/vllm/multimodal/processing.py", line 1209, in _apply_hf_processor_main
[rank0]:     prompt_ids = self._apply_hf_processor_text_only(prompt)
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ubuntu/oracle/vllm/vllm/multimodal/processing.py", line 1132, in _apply_hf_processor_text_only
[rank0]:     prompt_ids, _, _ = self._apply_hf_processor_text_mm(
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ubuntu/oracle/vllm/vllm/multimodal/processing.py", line 1102, in _apply_hf_processor_text_mm
[rank0]:     processed_data = self._call_hf_processor(
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ubuntu/oracle/vllm/vllm/model_executor/models/ovis.py", line 378, in _call_hf_processor
[rank0]:     return super()._call_hf_processor(prompt=prompt,
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ubuntu/oracle/vllm/vllm/multimodal/processing.py", line 1065, in _call_hf_processor
[rank0]:     return self.info.ctx.call_hf_processor(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ubuntu/oracle/vllm/vllm/inputs/registry.py", line 172, in call_hf_processor
[rank0]:     raise RuntimeError(msg) from exc
[rank0]: RuntimeError: Failed to apply OvisProcessor on data={'text': '<image>'} with kwargs={}

Somehow the <image> token is not handled properly during the profiling phase of vLLM. Can you point me into the right direction how is multimodal processing done in vLLM? Because I have tried to pass input_ids with image_placeholder token ids and pixel values which is outputted by the processor. I dont know exactly where that goes.

Copy link

mergify bot commented Mar 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Player256.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 3, 2025
@mergify mergify bot removed the needs-rebase label Mar 3, 2025
@Isotr0py
Copy link
Member

Isotr0py commented Mar 3, 2025

Somehow the token is not handled properly during the profiling phase of vLLM. Can you point me into the right direction how is multimodal processing done in vLLM? Because I have tried to pass input_ids with image_placeholder token ids and pixel values which is outputted by the processor. I dont know exactly where that goes.

I thought you need to implement the text-only processing for OvisProcessor, because text and image will be fed to the processor separately in some cases. (IIRC, the original Ovis Processor doesn't support text-only imputs)

@DarkLight1337 DarkLight1337 mentioned this pull request Mar 6, 2025
1 task
Copy link

mergify bot commented Apr 6, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Player256.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 6, 2025
@DarkLight1337
Copy link
Member

Closing as superseded by #17861

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation needs-rebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Model]: Ovis1.6-Gemma2-9B [New Model]: Add support for Ovis models

5 participants