-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Kernel] Use moe_wna16 kernel for compressed tensors wna16 moe models #16038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Use moe_wna16 kernel for compressed tensors wna16 moe models #16038
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: mgoin <[email protected]>
ccc1cf5
to
094dd92
Compare
if quant_config._is_wNa16_group_channel(weight_quant, input_quant): | ||
return CompressedTensorsWNA16MoEMethod(quant_config) | ||
# Prefer to use the non-marlin kernel when: | ||
# 1. Many experts (MarlinMoE gives poor performance when >= 16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the perf gap on marlin when >=16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It starts off low, basically equal at 16 experts but exponentially gets worse as experts increase such that at >100 experts it is essentially unusable. For the Marlin kernel, a single marlin_gemm_moe
would launching num_experts
CUDA kernels at least, while the fused_moe triton kernel only needs to launch one cuda kernel. This makes the Marlin kernel significantly slower than the fused_moe triton kernel. We will improve that marlin kernel soon!
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't see obvious wrong things. Is it possible to add some unittest?
Signed-off-by: mgoin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks for adding the integration!
…vllm-project#16038) Signed-off-by: mgoin <[email protected]> Signed-off-by: Yang Wang <[email protected]>
…vllm-project#16038) Signed-off-by: mgoin <[email protected]> Signed-off-by: Mu Huai <[email protected]>
Similar to #13236, but now to bring the moe_wna16 kernel to models in compressed-tensors format. This is crucial for models with many experts to have good performance or require bfloat16 dtype. We should still move to using the
quantization/kernel/
interface to implement this kernel selection properly, but this should be enough to unblock evals on large moes for ct format modelsValidated on DeepSeek-R1:
Comparative scores with Marlin and triton kernels on Mixtral (with support for bfloat16 now):