-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
FusedMoE
support for the Transformers backend
#22650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FusedMoE
support for the Transformers backend
#22650
Conversation
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for Mixture-of-Experts (MoE) models within the Transformers backend. The changes are well-structured, including refactoring ModelConfig
to handle MoE-specific properties, registering new MoE model classes, and adding a TransformersMoEBase
class to manage the replacement of standard MoE layers with vLLM's FusedMoE
. A critical issue was found where top_k
and intermediate_size
for the FusedMoE
layer were hardcoded, which would cause issues for many MoE models. A detailed comment with a suggested fix has been provided to address this.
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks nice already!
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
…o add `RMSNorm` replacement) Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not very familiar with the backend but was curious, just some thoughts
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Thanks for the interest! Slowly but surely I'm bolting on more and more functionality to it 🚀 |
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: yewentao256 <[email protected]>
Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Karan Goel <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
MoE models do already work with the Transformers backend, but the performance is not ideal because
FusedMoE
is not used. This means that each expert gets a dedicated linear layer which is executed in a for loop on the Transformers side.Depends on
FusedMoE.forward
is called correctly in the Transformers modelling codeChanges
This PR adds:
ModelConfig._get_transformers_backend_cls
based on the number of experts reported by the HF configFusedMoE.weight_loader
handling toAutoWeightsLoader.load_weights
which is only triggered ifexpert_mapping
is passed toAutoWeightsLoader.load_weights
(i.e. it's opt-in so won't break any existing custom weight loading)TransformersMoEBase
which can be subclassed and adds the necessary logic to substitute inFusedMoE
experts
and which are instances oftorch.nn.ModuleList
are replaced withFusedMoE
modulesload_weights
passes theexpert_mapping
directly toAutoWeightsLoader.load_weights
TransformersMoEModel
,TransformersMoEForCausalLM
,TransformersMoEForMultimodalLM
which leverages Python's MRO to add the MoE logic to their non-MoE counterpartsPerformance
Before this PR the Transformers backend for MoE's was 24x slower than the vLLM implementation. This is largely due to the Transformers modelling code not being CUDA graphs compilable (
--enforce-eager
was needed) and afor
loop iterating over the experts. This should be fixed by the PRs we depend on though.As you can see from the results below the average performance of the Transformers backend is <3% worse than the dedicated vLLM implementation!
Serve commands:
Benchmark command:
Results: