-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Description
🚀 The feature, motivation and pitch
🎉 #18343 introduces dynamic Expert Parallelism Load Balancing (EPLB) for DeepSeek-V2/V3/R1 models.
As MoE (Mixture-of-Experts) models become more common, we’d love help extending EPLB support to other MoE models—such as Qwen3, Llama 4, and more.
This is a great first good issue for anyone interested in model internals or systems work. #18343 was built with generality in mind, so extending it to other models or quantization methods should be relatively straightforward.
✅ How to add support for a new model
Implement the MixtureOfExperts
protocol. Specifically, you’ll need to:
- Expose relevant MoE configuration flags.
- Provide access to expert weights for EPLB to rearrange.
- Forward EPLB-related arguments into the
FusedMoE
layer.
📌 Note on weight loading:
For models with redundant experts, you’ll need to carefully adjust the weight loading logic. FusedMoE
returns an expert_params_mapping
that reflects expert duplication, but you may need to modify the model class to ensure correct loading behavior.
🔎 Example: See how it’s done in deepseek_v2.py
.
❗️Accuracy tests:
Since modifying the weight loader can be tricky, we suggest including an accuracy test (e.g., on GSM8k) in the PR to ensure the weight loading process remains intact.
✅ How to add support for quantized models
This is usually even easier—just make sure EPLB-related arguments are properly forwarded in your quantization path.
🔎 Example: See fp8.py
for a minimal working change.
👋 Want to contribute?
We’d love your help in extending EPLB support! Feel free to comment below or open a draft PR—we’re happy to guide you through the process.
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.