`FusedMoE` support for the Transformers backend #22650

hmellor · 2025-08-11T13:39:27Z

MoE models do already work with the Transformers backend, but the performance is not ideal because FusedMoE is not used. This means that each expert gets a dedicated linear layer which is executed in a for loop on the Transformers side.

Depends on

MoE + vllm = 😻 huggingface/transformers#40132 - so that FusedMoE.forward is called correctly in the Transformers modelling code
Fix GPTQ model loading in Transformers backend #25770 - prerequisite for supporting GPTQ for MoE models

Changes

This PR adds:

MoE model detection to ModelConfig._get_transformers_backend_cls based on the number of experts reported by the HF config
FusedMoE.weight_loader handling to AutoWeightsLoader.load_weights which is only triggered if expert_mapping is passed to AutoWeightsLoader.load_weights (i.e. it's opt-in so won't break any existing custom weight loading)
TransformersMoEBase which can be subclassed and adds the necessary logic to substitute in FusedMoE
- Modules named experts and which are instances of torch.nn.ModuleList are replaced with FusedMoE modules
- load_weights passes the expert_mapping directly to AutoWeightsLoader.load_weights
TransformersMoEModel, TransformersMoEForCausalLM, TransformersMoEForMultimodalLM which leverages Python's MRO to add the MoE logic to their non-MoE counterparts

Performance

Before this PR the Transformers backend for MoE's was 24x slower than the vLLM implementation. This is largely due to the Transformers modelling code not being CUDA graphs compilable (--enforce-eager was needed) and a for loop iterating over the experts. This should be fixed by the PRs we depend on though.

As you can see from the results below the average performance of the Transformers backend is <3% worse than the dedicated vLLM implementation!

Serve commands:

# vLLM reference
vllm serve Qwen/Qwen3-30B-A3B --model-impl vllm
# Transformers backend
vllm serve Qwen/Qwen3-30B-A3B --model-impl transformers

Benchmark command:

vllm bench serve --model Qwen/Qwen3-30B-A3B --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 500 --request-rate inf --save-result

Results:

Metric	vllm	transformers	Diff (%)
Request Throughput (req/s)	20.0	19.5	-2.7%
Output Throughput (tok/s)	4397	4278	-2.7%
Total Token Throughput (tok/s)	8536	8305	-2.7%
Mean TTFT (ms)	1770	1794	+1.4%
P99 TTFT (ms)	3174	3214	+1.2%
Mean TPOT (ms)	63.6	65.2	+2.5%
P99 TPOT (ms)	204.5	207.1	+1.3%
Mean ITL (ms)	39.5	40.4	+2.3%

Signed-off-by: Harry Mellor <[email protected]>

gemini-code-assist

Code Review

This pull request introduces support for Mixture-of-Experts (MoE) models within the Transformers backend. The changes are well-structured, including refactoring ModelConfig to handle MoE-specific properties, registering new MoE model classes, and adding a TransformersMoEBase class to manage the replacement of standard MoE layers with vLLM's FusedMoE. A critical issue was found where top_k and intermediate_size for the FusedMoE layer were hardcoded, which would cause issues for many MoE models. A detailed comment with a suggested fix has been provided to address this.

vllm/model_executor/models/transformers.py

github-actions · 2025-08-11T13:52:47Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

ArthurZucker

looks nice already!

vllm/model_executor/models/transformers.py

Signed-off-by: Harry Mellor <[email protected]>

mergify · 2025-08-18T05:20:49Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @hmellor.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Harry Mellor <[email protected]>

…o add `RMSNorm` replacement) Signed-off-by: Harry Mellor <[email protected]>

Signed-off-by: Harry Mellor <[email protected]>

vllm/model_executor/models/transformers.py

Signed-off-by: Harry Mellor <[email protected]>

ProExpertProg

Not very familiar with the backend but was curious, just some thoughts

vllm/model_executor/models/transformers.py

Signed-off-by: Harry Mellor <[email protected]>

hmellor · 2025-10-02T19:20:39Z

Not very familiar with the backend but was curious, just some thoughts

Thanks for the interest! Slowly but surely I'm bolting on more and more functionality to it 🚀

Signed-off-by: Harry Mellor <[email protected]>

Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: yewentao256 <[email protected]>

Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Tomer Asida <[email protected]>

Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Karan Goel <[email protected]>

Signed-off-by: Harry Mellor <[email protected]>

hmellor added 2 commits August 11, 2025 15:24

Add MoE base class to Transformers backend

3b3356b

Signed-off-by: Harry Mellor <[email protected]>

Add MoE model classes to Transformers backend

ee32a6e

Signed-off-by: Harry Mellor <[email protected]>

mergify bot added the new-model Requests to new models label Aug 11, 2025

gemini-code-assist bot reviewed Aug 11, 2025

View reviewed changes

vllm/model_executor/models/transformers.py Outdated Show resolved Hide resolved

ArthurZucker reviewed Aug 11, 2025

View reviewed changes

vllm/model_executor/models/transformers.py Outdated Show resolved Hide resolved

vllm/model_executor/models/transformers.py Outdated Show resolved Hide resolved

Isotr0py reviewed Aug 11, 2025

View reviewed changes

vllm/model_executor/models/transformers.py Outdated Show resolved Hide resolved

hmellor added 10 commits August 13, 2025 08:58

Merge branch 'main' into transformers-backend-fused-moe

1f0b49e

Make expert mapping comprehensive

64e827e

Signed-off-by: Harry Mellor <[email protected]>

Add guard against EPLB until it's tested

48ab51a

Signed-off-by: Harry Mellor <[email protected]>

Add FusedMoE weight loading to AutoWeightLoader

80f15e2

Signed-off-by: Harry Mellor <[email protected]>

More compact expert mapping generation

68a132f

Signed-off-by: Harry Mellor <[email protected]>

Use forward hook to perform all reduce after experts in TP/EP

781d15d

Signed-off-by: Harry Mellor <[email protected]>

Remove params_dtype because it's not used for most MoEs

8b1f3e8

Signed-off-by: Harry Mellor <[email protected]>

Better debug log for expert loading

5b6f5fd

Signed-off-by: Harry Mellor <[email protected]>

Update transformers backend doc

c19ae9b

Signed-off-by: Harry Mellor <[email protected]>

Small doc tweak

a9272bb

Signed-off-by: Harry Mellor <[email protected]>

mergify bot added the documentation Improvements or additions to documentation label Aug 14, 2025

hmellor added 5 commits August 14, 2025 10:20

Add docstring to reduce_results

690df42

Signed-off-by: Harry Mellor <[email protected]>

Set renormalize correctly

fd8bddb

Signed-off-by: Harry Mellor <[email protected]>

Set top_k correctly

87473f7

Signed-off-by: Harry Mellor <[email protected]>

Add support for grouped topk expert selection

4d7e41c

Signed-off-by: Harry Mellor <[email protected]>

Make use_grouped_topk a bool

a6c0483

Signed-off-by: Harry Mellor <[email protected]>

mergify bot added the needs-rebase label Aug 18, 2025

Merge branch 'main' into transformers-backend-fused-moe

827540c

Signed-off-by: Harry Mellor <[email protected]>

mergify bot removed the needs-rebase label Aug 25, 2025

hmellor added 3 commits August 25, 2025 15:56

Add note for removal of get_num_experts

cc45642

Signed-off-by: Harry Mellor <[email protected]>

Better handling of shared experts and renosmalisation

d711ab5

Signed-off-by: Harry Mellor <[email protected]>

Add util which does getattr for a list of possible names

88154b6

Signed-off-by: Harry Mellor <[email protected]>

hmellor enabled auto-merge (squash) October 2, 2025 10:28

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 2, 2025

hmellor added 2 commits October 2, 2025 12:59

Merge branch 'main' into transformers-backend-fused-moe

077f62f

Merge branch 'main' into transformers-backend-fused-moe

2ff8022

hmellor disabled auto-merge October 2, 2025 16:45

hmellor added 5 commits October 2, 2025 19:16

Type hint TransformersPoolingBase.create_attention_instances properly

48d5993

Signed-off-by: Harry Mellor <[email protected]>

Merge init_hook and tensor_parallel into recursive_replace (als…

2da6140

…o add `RMSNorm` replacement) Signed-off-by: Harry Mellor <[email protected]>

Add min transformers version to skip the init tests

e55e5d6

Signed-off-by: Harry Mellor <[email protected]>

Add edge case for Ernie

3c1b8f8

Signed-off-by: Harry Mellor <[email protected]>

Add missing classes to test registry

839ef4b

Signed-off-by: Harry Mellor <[email protected]>

hmellor commented Oct 2, 2025

View reviewed changes

vllm/model_executor/models/transformers.py Show resolved Hide resolved

Update vllm/model_executor/models/transformers.py

006902e

Signed-off-by: Harry Mellor <[email protected]>

ProExpertProg reviewed Oct 2, 2025

View reviewed changes

vllm/model_executor/models/transformers.py Outdated Show resolved Hide resolved

vllm/model_executor/models/transformers.py Outdated Show resolved Hide resolved

hmellor added 3 commits October 2, 2025 21:02

Always return RMSNorm

1c32431

Signed-off-by: Harry Mellor <[email protected]>

Type hint replace_linear_class correctly

b01bd6e

Signed-off-by: Harry Mellor <[email protected]>

Can't use 1 because vLLM checks hidden size agains input

b1b62be

Signed-off-by: Harry Mellor <[email protected]>

hmellor added 4 commits October 3, 2025 00:42

Fix test util making everything MoE...

1d4f3f9

Signed-off-by: Harry Mellor <[email protected]>

Merge branch 'main' into transformers-backend-fused-moe

761c377

remove print

954e163

Signed-off-by: Harry Mellor <[email protected]>

Disable RMSNorm swapping for now

adca299

Signed-off-by: Harry Mellor <[email protected]>

hmellor enabled auto-merge (squash) October 2, 2025 23:07

vllm-bot merged commit 10d7654 into vllm-project:main Oct 3, 2025
56 of 58 checks passed

github-project-automation bot moved this from In Progress to Done in Transformers backend Oct 3, 2025

hmellor deleted the transformers-backend-fused-moe branch October 3, 2025 06:12

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

FusedMoE support for the Transformers backend (#22650)

6b12b2e

Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: yewentao256 <[email protected]>

tomeras91 pushed a commit to tomeras91/vllm that referenced this pull request Oct 6, 2025

FusedMoE support for the Transformers backend (vllm-project#22650)

9fba170

Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Tomer Asida <[email protected]>

karan pushed a commit to karan/vllm that referenced this pull request Oct 6, 2025

FusedMoE support for the Transformers backend (vllm-project#22650)

59fecbf

Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Karan Goel <[email protected]>

southfreebird pushed a commit to southfreebird/vllm that referenced this pull request Oct 7, 2025

FusedMoE support for the Transformers backend (vllm-project#22650)

0299d0e

Signed-off-by: Harry Mellor <[email protected]>

hmellor mentioned this pull request Oct 7, 2025

[Bug]: Loading Qwen3MoE using Transformers backend #19801

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

`FusedMoE` support for the Transformers backend #22650

`FusedMoE` support for the Transformers backend #22650

Uh oh!

hmellor commented Aug 11, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

github-actions bot commented Aug 11, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Aug 18, 2025

Uh oh!

Uh oh!

ProExpertProg left a comment

Uh oh!

Uh oh!

Uh oh!

hmellor commented Oct 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FusedMoE support for the Transformers backend #22650

FusedMoE support for the Transformers backend #22650

Uh oh!

Conversation

hmellor commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Depends on

Changes

Performance

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions bot commented Aug 11, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Aug 18, 2025

Uh oh!

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hmellor commented Oct 2, 2025

Uh oh!

Uh oh!

Uh oh!

`FusedMoE` support for the Transformers backend #22650

`FusedMoE` support for the Transformers backend #22650

hmellor commented Aug 11, 2025 •

edited

Loading