Skip to content

Conversation

hmellor
Copy link
Member

@hmellor hmellor commented Aug 11, 2025

MoE models do already work with the Transformers backend, but the performance is not ideal because FusedMoE is not used. This means that each expert gets a dedicated linear layer which is executed in a for loop on the Transformers side.

Depends on

Changes

This PR adds:

  • MoE model detection to ModelConfig._get_transformers_backend_cls based on the number of experts reported by the HF config
  • FusedMoE.weight_loader handling to AutoWeightsLoader.load_weights which is only triggered if expert_mapping is passed to AutoWeightsLoader.load_weights (i.e. it's opt-in so won't break any existing custom weight loading)
  • TransformersMoEBase which can be subclassed and adds the necessary logic to substitute in FusedMoE
    • Modules named experts and which are instances of torch.nn.ModuleList are replaced with FusedMoE modules
    • load_weights passes the expert_mapping directly to AutoWeightsLoader.load_weights
  • TransformersMoEModel, TransformersMoEForCausalLM, TransformersMoEForMultimodalLM which leverages Python's MRO to add the MoE logic to their non-MoE counterparts

Performance

Before this PR the Transformers backend for MoE's was 24x slower than the vLLM implementation. This is largely due to the Transformers modelling code not being CUDA graphs compilable (--enforce-eager was needed) and a for loop iterating over the experts. This should be fixed by the PRs we depend on though.

As you can see from the results below the average performance of the Transformers backend is <3% worse than the dedicated vLLM implementation!

Serve commands:

# vLLM reference
vllm serve Qwen/Qwen3-30B-A3B --model-impl vllm
# Transformers backend
vllm serve Qwen/Qwen3-30B-A3B --model-impl transformers

Benchmark command:

vllm bench serve --model Qwen/Qwen3-30B-A3B --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 500 --request-rate inf --save-result

Results:

Metric vllm transformers Diff (%)
Request Throughput (req/s) 20.0 19.5 -2.7%
Output Throughput (tok/s) 4397 4278 -2.7%
Total Token Throughput (tok/s) 8536 8305 -2.7%
Mean TTFT (ms) 1770 1794 +1.4%
P99 TTFT (ms) 3174 3214 +1.2%
Mean TPOT (ms) 63.6 65.2 +2.5%
P99 TPOT (ms) 204.5 207.1 +1.3%
Mean ITL (ms) 39.5 40.4 +2.3%

@mergify mergify bot added the new-model Requests to new models label Aug 11, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Mixture-of-Experts (MoE) models within the Transformers backend. The changes are well-structured, including refactoring ModelConfig to handle MoE-specific properties, registering new MoE model classes, and adding a TransformersMoEBase class to manage the replacement of standard MoE layers with vLLM's FusedMoE. A critical issue was found where top_k and intermediate_size for the FusedMoE layer were hardcoded, which would cause issues for many MoE models. A detailed comment with a suggested fix has been provided to address this.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks nice already!

@mergify mergify bot added the documentation Improvements or additions to documentation label Aug 14, 2025
Copy link

mergify bot commented Aug 18, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @hmellor.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 18, 2025
@mergify mergify bot removed the needs-rebase label Aug 25, 2025
@hmellor hmellor enabled auto-merge (squash) October 2, 2025 10:28
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 2, 2025
@hmellor hmellor disabled auto-merge October 2, 2025 16:45
Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very familiar with the backend but was curious, just some thoughts

@hmellor
Copy link
Member Author

hmellor commented Oct 2, 2025

Not very familiar with the backend but was curious, just some thoughts

Thanks for the interest! Slowly but surely I'm bolting on more and more functionality to it 🚀

@hmellor hmellor enabled auto-merge (squash) October 2, 2025 23:07
@vllm-bot vllm-bot merged commit 10d7654 into vllm-project:main Oct 3, 2025
56 of 58 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Transformers backend Oct 3, 2025
@hmellor hmellor deleted the transformers-backend-fused-moe branch October 3, 2025 06:12
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: yewentao256 <[email protected]>
tomeras91 pushed a commit to tomeras91/vllm that referenced this pull request Oct 6, 2025
karan pushed a commit to karan/vllm that referenced this pull request Oct 6, 2025
southfreebird pushed a commit to southfreebird/vllm that referenced this pull request Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation new-model Requests to new models ready ONLY add when PR is ready to merge/full CI is needed
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants