-
-
Notifications
You must be signed in to change notification settings - Fork 10.2k
[V1] Support interleaved modality items #15605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
We still need to update |
No I don't think we need to - this is because model runner will try to batch only consecutive inputs of the same modality, or otherwise execute encoder individually. vllm/vllm/v1/worker/gpu_model_runner.py Lines 839 to 865 in 3f532cb
|
Hmm in that case, can we add a test to verify this e2e? We can use |
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Yea we can - I'll add a test for this during the day |
Let's also update the V1 User Guide to remove the part about interleaved modality not supported. |
Interestingly - this model seems to ignore the order of modalities from the frontend and the processor will always group modalities together |
By "frontend" are you referring to online inference? I think this is because the chat template is detected as |
Signed-off-by: Roger Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There appears to be an issue with the test runners for validating mixed-modality inputs.
As discussed offline, let's merge this PR first to unblock Qwen2.5-Omni, and work on fixing this in another PR.
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Add a simple test to verify interleaving modalities generates a different result in greedy decoding. |
tests/conftest.py
Outdated
for i in range(len(inputs)): | ||
inputs[i]["multi_modal_data"] = {} | ||
if images is not None and (image := images[i]) is not None: | ||
inputs[i]["multi_modal_data"]["image"] = image | ||
|
||
if audios is not None: | ||
for i, audio in enumerate(audios): | ||
if audio is not None: | ||
inputs[i]["multi_modal_data"] = {"audio": audio} | ||
if videos is not None and (video := videos[i]) is not None: | ||
inputs[i]["multi_modal_data"]["video"] = video | ||
|
||
if audios is not None and (audio := audios[i]) is not None: | ||
inputs[i]["multi_modal_data"]["audio"] = audio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test model runner previously didn't support testing with multiple modalities and this PR fixed it.
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]> Signed-off-by: xinyuxiao <[email protected]>
Signed-off-by: Roger Wang <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]> Signed-off-by: Mu Huai <[email protected]>
Partially unblock #15130 - the current workaround can be inefficient in case there are a large number of interleaved items in the current batch (e.g,
<image><video><image><video><image>
), but this should be a rare case as typically these embeddings are individually fairly big.Note
use_audio_in_video
is not yet covered by this PR and will require a bit of more work, since it will require mixing of modality items/interleaved embeddings.