Add Qwen3-Omni moe thinker #25550

wangxiongts · 2025-09-24T07:18:30Z

This PR from the Qwen team for: qwen3-omni-moe thinker part.

Testing has been conducted internally across four configurations (v0/v1, eager/CUDA) on several representative benchmarks, with results meeting expectations.

Known issues (we hope to resolve them together with the vLLM team):

In v1 mode, use_audio_in_video will raise errors because the video mm_data and placeholders is not updated.

We sincerely appreciate the great work and support from the vLLM team, and look forward to your feedback.

CLOSE #25472

github-actions · 2025-09-24T07:18:37Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request adds support for the Qwen3-Omni-Moe model. The changes include a new model implementation file, modifications to handle multimodal rotary embeddings, and registration of the new model. While the implementation is comprehensive, I've identified several critical and high-severity issues related to performance and maintainability. Specifically, there are non-vectorized loops and inefficient tensor operations in the position embedding calculation, which will significantly impact performance. Additionally, there are uses of NumPy within core logic that should be replaced with PyTorch operations to avoid CPU-GPU synchronization. I've also found a few potential bugs related to tensor shape calculations that could lead to runtime errors. Addressing these points will be crucial for integrating this model into vLLM effectively.

gemini-code-assist · 2025-09-24T07:21:12Z

vllm/model_executor/layers/rotary_embedding/mrope.py

+    def _omni3_get_input_positions_tensor(
+        cls,
+        config,
+        input_ids: Optional[torch.LongTensor] = None,
+        image_grid_thw: Optional[torch.LongTensor] = None,
+        video_grid_thw: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        use_audio_in_video: bool = False,
+        audio_seqlens: Optional[torch.LongTensor] = None,
+        second_per_grids: Optional[torch.Tensor] = None,
+    ) -> tuple[torch.Tensor, torch.Tensor]:


The function _omni3_get_input_positions_tensor is very long and complex, making it difficult to understand and maintain. More importantly, it processes input sequences one by one within a for loop (for i, input_ids in enumerate(total_input_ids):), which is not vectorized and will lead to significant performance degradation, especially with larger batch sizes. The use of .tolist() and list methods like .index() inside the loop further contributes to the inefficiency. This implementation should be refactored to be vectorized over the batch dimension to meet the performance standards of vLLM. Consider using tensor operations to find indices and process modalities in parallel for all sequences in the batch.

gemini-code-assist · 2025-09-24T07:21:12Z

vllm/model_executor/models/qwen3_omni_moe_thinker.py

+        if name == "feature_attention_mask":
+            dim = -1
+        if isinstance(mm_input, torch.Tensor):
+            return torch.concat(list(mm_input), dim=dim)


The implementation of _validate_and_reshape_mm_tensor seems to have a bug when handling a torch.Tensor. The line return torch.concat(list(mm_input), dim=dim) is problematic. When mm_input is a tensor, list(mm_input) iterates over its first dimension. torch.concat then joins these tensors along dim. For example, if mm_input has shape (B, C, L) and dim=1, the result will have shape (C, B*L), which is likely incorrect for batch processing where one would expect to flatten the batch dimension. This will likely cause shape mismatches in downstream processing.

gemini-code-assist · 2025-09-24T07:21:12Z

vllm/model_executor/models/qwen3_omni_moe_thinker.py

+                        multimodal_embeddings[index] = embeddings_main
+                        multimodal_embeddings_multiscale.append(embeddings_multiscale)
+                if len(multimodal_embeddings_multiscale) > 0:
+                    deepstack_input_embeds = inputs_embeds.new_zeros(inputs_embeds.size(0), multiscale_len * inputs_embeds.size(1))


There appears to be a bug in the shape calculation for deepstack_input_embeds. The second dimension is calculated as multiscale_len * inputs_embeds.size(1), which resolves to multiscale_len * text_config.hidden_size. However, this tensor is later populated with multimodal_embeddings_multiscale which have a feature dimension of multi_dim (multiscale_len * visual_dim), and then reshaped using visual_dim. This will raise a runtime error if text_config.hidden_size is not equal to visual_dim (vision_config.out_hidden_size). The correct size for the second dimension should be multi_dim (i.e., multiscale_len * visual_dim), which is computed a few lines above.

deepstack_input_embeds = inputs_embeds.new_zeros(inputs_embeds.size(0), multi_dim)

gemini-code-assist · 2025-09-24T07:21:12Z

vllm/model_executor/layers/rotary_embedding/mrope.py

+                                                                                        None, 
+                                                                                        use_audio_in_video, 
+                                                                                        audio_feature_lengths, 
+                                                                                        torch.tensor([1] * torch.tensor(video_grid_thw).shape[0]))


This line creates a tensor in a highly inefficient way. torch.tensor(video_grid_thw) is redundant as video_grid_thw is already a tensor at this point. Creating a list of 1s and then converting it to a tensor is also inefficient. This can be simplified and made more performant.

Suggested change

torch.tensor([1] * torch.tensor(video_grid_thw).shape[0]))

torch.ones(video_grid_thw.shape[0], dtype=torch.long, device=video_grid_thw.device))

gemini-code-assist · 2025-09-24T07:21:12Z

vllm/model_executor/models/qwen3_omni_moe_thinker.py

+            h_idxs = np.linspace(0, num_grid_per_side-1, h)
+            w_idxs = np.linspace(0, num_grid_per_side-1, w)


This function uses numpy for calculations (np.linspace), which can lead to performance bottlenecks due to CPU-GPU synchronization and data transfers. The comment on line 379 already indicates this. These operations should be replaced with their torch equivalents to keep the computation on the GPU and within the computation graph.

Suggested change

h_idxs = np.linspace(0, num_grid_per_side-1, h)

w_idxs = np.linspace(0, num_grid_per_side-1, w)

h_idxs = torch.linspace(0, num_grid_per_side-1, h, device=self.pos_embed.weight.device)

w_idxs = torch.linspace(0, num_grid_per_side-1, w, device=self.pos_embed.weight.device)

Are you able to finish this TODO before you have to go OOO?

gemini-code-assist · 2025-09-24T07:21:12Z

vllm/model_executor/models/qwen3_omni_moe_thinker.py

+        audio_token_indices = np.arange(next(iter([audio_len])))
+        curr_video_grid_thw = next(iter([video_grid_thw]))
+        height = curr_video_grid_thw[1] // spatial_merge_size
+        width = curr_video_grid_thw[2] // spatial_merge_size
+        video_token_indices = np.arange(curr_video_grid_thw[0]).reshape(-1, 1, 1)
+        video_token_indices = np.broadcast_to(
+            video_token_indices, (video_token_indices.shape[0], height, width)
+        ).reshape(-1)
+        video_token_indices = ((video_token_indices + shift) * next(iter([video_second_per_grid_t])) * position_id_per_seconds)


This function uses numpy for array creation and manipulation (np.arange, np.broadcast_to). This forces data transfers between CPU and GPU and can be a performance bottleneck. These should be replaced with torch equivalents to maintain performance.

audio_token_indices = torch.arange(next(iter([audio_len]))) curr_video_grid_thw = next(iter([video_grid_thw])) height = curr_video_grid_thw[1] // spatial_merge_size width = curr_video_grid_thw[2] // spatial_merge_size video_token_indices = torch.arange(curr_video_grid_thw[0]).reshape(-1, 1, 1) video_token_indices = video_token_indices.expand(video_token_indices.shape[0], height, width).reshape(-1) video_token_indices = ((video_token_indices + shift) * next(iter([video_second_per_grid_t])) * position_id_per_seconds)

DarkLight1337

Thanks, can you update tests/models/registry.py to be able to pass the CI?

Also please update the Supported Models page

wangxiongts · 2025-09-24T07:32:26Z

Alright, I'll handle these parts. Currently, I'm still working on adding audio-in-video support in v1, In the meantime, One known issue is that I may not be able to straightforwardly reuse relevant modules from Qwen3-VL, because our model has already been made public, and some checkpoint keys and configurations are incompatible with Qwen3-VL. This stems from the fact that our internal iterations were not synchronized. This issue may require further careful discussion.

I might go on vacation starting tomorrow and probably won't resume modifications until after October 4th :) You can proceed with the review based on the current version.

DarkLight1337 · 2025-09-24T07:50:39Z

vllm/model_executor/layers/rotary_embedding/mrope.py

+                                                                                        None, 
+                                                                                        use_audio_in_video, 
+                                                                                        audio_feature_lengths, 
+                                                                                        torch.tensor([1] * torch.tensor(video_grid_thw).shape[0]))


Suggested change

torch.tensor([1] * torch.tensor(video_grid_thw).shape[0]))

torch.ones(len(video_grid_thw))

Simplify this

CHNtentes · 2025-09-24T16:44:22Z

Thanks for your work. May I ask, will talker model get supported in future? It seems Qwen2.5-Omni still only support thinker model now.

Wesley-Jzy · 2025-09-24T18:54:35Z

LGTM! May I know whether Talker model will be supported by vLLM?

ywang96 · 2025-09-24T18:58:50Z

Supporting Qwen3-Omni end-to-end will not be within the scope of vllm-project/vllm, but we already have some plan on supporting this model under a different project but leveraging the thinker models support from vLLM. Stay tuned!

Wesley-Jzy · 2025-09-24T19:17:12Z

Supporting Qwen3-Omni end-to-end will not be within the scope of vllm-project/vllm, but we already have some plan on supporting this model under a different project but leveraging the thinker models support from vLLM. Stay tuned!

So that means vLLM project will support thinking model which is just like normal LLM model. And a new multimodal inference project will support end2end Qwen3-Omni model? Do I know more about this new project?

ywang96 · 2025-09-24T19:43:27Z

Supporting Qwen3-Omni end-to-end will not be within the scope of vllm-project/vllm, but we already have some plan on supporting this model under a different project but leveraging the thinker models support from vLLM. Stay tuned!

So that means vLLM project will support thinking model which is just like normal LLM model. And a new multimodal inference project will support end2end Qwen3-Omni model? Do I know more about this new project?

Yea that's the right understanding! We're still planning for the new project so stay tuned!

Wesley-Jzy · 2025-09-24T20:44:09Z

Supporting Qwen3-Omni end-to-end will not be within the scope of vllm-project/vllm, but we already have some plan on supporting this model under a different project but leveraging the thinker models support from vLLM. Stay tuned!

So that means vLLM project will support thinking model which is just like normal LLM model. And a new multimodal inference project will support end2end Qwen3-Omni model? Do I know more about this new project?

Yea that's the right understanding! We're still planning for the new project so stay tuned!

Great! And may I know will the new project also handle the single-model multimodal models such as Kimi-Audio? Or they will be supported by vLLM?

CHNtentes · 2025-09-24T23:37:23Z

Supporting Qwen3-Omni end-to-end will not be within the scope of vllm-project/vllm, but we already have some plan on supporting this model under a different project but leveraging the thinker models support from vLLM. Stay tuned!

So that means vLLM project will support thinking model which is just like normal LLM model. And a new multimodal inference project will support end2end Qwen3-Omni model? Do I know more about this new project?

Yea that's the right understanding! We're still planning for the new project so stay tuned!

Really wish the new project is fast and efficient. Tried transformers and audio output was SLOW...

vllm-project#25550 Signed-off-by: Chen, Wenbin <[email protected]>

eschmidbauer · 2025-09-25T12:50:52Z

Really wish the new project is fast and efficient. Tried transformers and audio output was SLOW...

Same, even with flash-attn2 it is very slow

CHNtentes · 2025-09-25T12:55:01Z

Really wish the new project is fast and efficient. Tried transformers and audio output was SLOW...

Same, even with flash-attn2 it is very slow

I tried this PR and it's like >20x faster than transformers :)

facebook-github-bot · 2025-09-25T19:26:18Z

@houseroad has imported this pull request. If you are a Meta employee, you can view this in D83274891.

Signed-off-by: DarkLight1337 <[email protected]>

vllm/model_executor/layers/rotary_embedding/mrope.py

UmutAlihan · 2025-10-02T22:31:16Z

Looking very forward to be able to inference using vLLM for this omni model. Compiling vllm from specific branch source is really hustling on some limited access corporate standalone servers 😿.

ywang96 · 2025-10-03T04:13:43Z

FYI I'm back to working on this PR

tensorboy · 2025-10-03T04:33:23Z

FYI I'm back to working on this PR

you are the best!

Signed-off-by: Roger Wang <[email protected]>

fikrikarim · 2025-10-05T03:45:18Z

Thank you for all the hard work. We really appreciate it.

From what I understand this PR is only for the thinker model and it only supports text output.

Is there any rough timeline when audio output will be added? Thank you so much!

Signed-off-by: Roger Wang <[email protected]>

ywang96 · 2025-10-05T04:37:51Z

This PR should be again functional - I probably won't fix the audio_in_video issue and will have it fixed together in #26156 so should be able to merge it pretty soon

ywang96 · 2025-10-05T05:11:20Z

Thank you for all the hard work. We really appreciate it.

From what I understand this PR is only for the thinker model and it only supports text output.

Is there any rough timeline when audio output will be added? Thank you so much!

We don't have a rough timeline yet but hopefully by the end of this year/early next year

fikrikarim · 2025-10-05T05:15:18Z

We don't have a rough timeline yet but hopefully by the end of this year/early next year

Got it. Thanks!

Signed-off-by: Roger Wang <[email protected]>

mergify · 2025-10-07T00:07:03Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wangxiongts.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ai-and-i · 2025-10-07T01:53:14Z

Hi, thanks for the great work on this PR! I tried to run it and it works great when providing (one of audio/image/video)+text or image+video+text. However, when I'm running it with audio+image+text or audio+video+text, it crashes. I made a gist with a small example: https://gist.github.com/ai-and-i/76b75f1bef2f2df6b1ea5998c3911918 (I tested it in a jupyter notebook with
a985baa).

mergify · 2025-10-08T12:10:31Z

Documentation preview: https://vllm--25550.org.readthedocs.build/en/25550/

aldazero · 2025-10-08T14:22:20Z

Hi, thanks for the great work on this. Does this support audio output? @wangxiongts

Signed-off-by: Roger Wang <[email protected]>

feat: Add Qwen3 omni moe thinker

9fe9994

wangxiongts requested a review from sighingnow as a code owner September 24, 2025 07:18

mergify bot added new-model Requests to new models qwen Related to Qwen models labels Sep 24, 2025

gemini-code-assist bot reviewed Sep 24, 2025

View reviewed changes

DarkLight1337 reviewed Sep 24, 2025

View reviewed changes

ywang96 self-assigned this Sep 24, 2025

update registry and models page

93efc39

wangxiongts requested a review from ywang96 as a code owner September 24, 2025 07:46

mergify bot added the documentation Improvements or additions to documentation label Sep 24, 2025

DarkLight1337 reviewed Sep 24, 2025

View reviewed changes

wenbinc-Bin added a commit to wenbinc-Bin/vllm-fork that referenced this pull request Sep 25, 2025

Add Qwen3-Omni moe thinker (vllm-project#25550)

df7c572

vllm-project#25550 Signed-off-by: Chen, Wenbin <[email protected]>

DarkLight1337 added 2 commits September 27, 2025 08:32

Merge branch 'main' into dev/qwen3-omni-moe

81fd24b

Update w.r.t. vllm-project#16229

f0d057a

Signed-off-by: DarkLight1337 <[email protected]>

weedge mentioned this pull request Sep 28, 2025

feat: add transformers Qwen3Omni thinker chunk stream for livekit webrtc room vision+voice bot ai-bot-pro/achatbot#196

Merged

DarkLight1337 mentioned this pull request Sep 29, 2025

[Bug]: Qwen3 Omni Support #25834

Closed

1 task

ywang96 reviewed Sep 29, 2025

View reviewed changes

vllm/model_executor/layers/rotary_embedding/mrope.py Show resolved Hide resolved

ywang96 mentioned this pull request Sep 29, 2025

[Feature]: Qwen3 Omni Support #25809

Closed

1 task

Merge branch 'main' into dev/qwen3-omni-moe

c3e15a6

ywang96 added 7 commits October 3, 2025 01:14

Merge branch 'main' into dev/qwen3-omni-moe

d59ac08

Merge branch 'main' into dev/qwen3-omni-moe

0b24c98

remove attn mask

087a936

Signed-off-by: Roger Wang <[email protected]>

update

8ffc26e

Signed-off-by: Roger Wang <[email protected]>

Merge branch 'main' into dev/qwen3-omni-moe

fb1d82b

fix backend import

7f42fb0

Signed-off-by: Roger Wang <[email protected]>

fix prompt update

7408b9c

Signed-off-by: Roger Wang <[email protected]>

ywang96 added 2 commits October 4, 2025 21:29

yapf

8e1f5aa

Signed-off-by: Roger Wang <[email protected]>

Merge branch 'main' into dev/qwen3-omni-moe

3c44f89

Merge branch 'main' into dev/qwen3-omni-moe

a985baa

Signed-off-by: Roger Wang <[email protected]>

mergify bot added the needs-rebase label Oct 7, 2025

DarkLight1337 mentioned this pull request Oct 7, 2025

[Tracking Issue]: Use merge_by_field_config for MM models #26149

Open

12 tasks

Merge branch 'main' into dev/qwen3-omni-moe

0525d27

Signed-off-by: Roger Wang <[email protected]>

mergify bot removed the needs-rebase label Oct 8, 2025

	torch.tensor([1] * torch.tensor(video_grid_thw).shape[0]))
	torch.ones(video_grid_thw.shape[0], dtype=torch.long, device=video_grid_thw.device))

		h_idxs = np.linspace(0, num_grid_per_side-1, h)
		w_idxs = np.linspace(0, num_grid_per_side-1, w)

	torch.tensor([1] * torch.tensor(video_grid_thw).shape[0]))
	torch.ones(len(video_grid_thw))

Uh oh!

Add Qwen3-Omni moe thinker #25550

Are you sure you want to change the base?

Add Qwen3-Omni moe thinker #25550

Conversation

wangxiongts commented Sep 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangxiongts commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

CHNtentes commented Sep 24, 2025

Uh oh!

Wesley-Jzy commented Sep 24, 2025

Uh oh!

ywang96 commented Sep 24, 2025

Uh oh!

Wesley-Jzy commented Sep 24, 2025

Uh oh!

ywang96 commented Sep 24, 2025

Uh oh!

Wesley-Jzy commented Sep 24, 2025

Uh oh!

CHNtentes commented Sep 24, 2025

Uh oh!

eschmidbauer commented Sep 25, 2025

Uh oh!

CHNtentes commented Sep 25, 2025

Uh oh!

facebook-github-bot commented Sep 25, 2025

Uh oh!

Uh oh!

UmutAlihan commented Oct 2, 2025

Uh oh!

ywang96 commented Oct 3, 2025

Uh oh!

tensorboy commented Oct 3, 2025

Uh oh!

fikrikarim commented Oct 5, 2025

Uh oh!

ywang96 commented Oct 5, 2025

Uh oh!

ywang96 commented Oct 5, 2025

Uh oh!

fikrikarim commented Oct 5, 2025

Uh oh!

wangxiongts commented Sep 24, 2025 •

edited by github-actions bot

Loading

DarkLight1337 left a comment •

edited

Loading

wangxiongts commented Sep 24, 2025 •

edited

Loading

aldazero commented Oct 8, 2025 •

edited

Loading