-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
[Kernels] Overlap shared experts with combine instead of dispatch #24254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the MoE kernel to overlap shared expert computation with the combine
step instead of the dispatch
step, which is a sensible performance optimization as the combine step is typically more time-consuming. This is achieved by introducing a new finalize_async
method to the FusedMoEPrepareAndFinalize
interface. The changes are well-contained, and the implementations for different backends (DeepEP HT, DeepEP LL, PPLX) are updated accordingly. The core logic change in FusedMoEModularKernel
correctly orchestrates the asynchronous finalization with the shared expert computation. My review found one issue with a type hint that should be addressed for correctness.
vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py
Outdated
Show resolved
Hide resolved
/ready |
38fe4cd
to
16e7842
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; would be good to add lm_eval, traces and if possible perf numbers to the PR
This pull request has merge conflicts that must be resolved before it can be |
7fef858
to
d4a33ea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable, @bnellnm. Just one nit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
7fc047c
to
be2cd76
Compare
…litPR into model_register * 'model_register' of https://github.com/dsxsteven/vllm_splitPR: (138 commits) Retrieve `sliding_window` from text config in Gemma3 MM (vllm-project#25085) [Docs] Fix API Reference (vllm-project#25140) [Kernel] Better inf handling for grouped topk cu (vllm-project#24886) [CLI] Use streaming in CLI chat and completion commands (vllm-project#23769) [benchmark] add peak throughput metrics and plot (vllm-project#23867) [Spec Decode] Efficient padded speculation (vllm-project#24539) [V0 Deprecation] Remove more V0 tests (vllm-project#25117) [EPLB] Add EPLB support for hunyuan_v1 (vllm-project#23078) [XPU] Whisper model support on XPU Platform (vllm-project#25123) Mark prompt logprobs as incompatible with prompt embeds at API level (vllm-project#25077) [Model] enable data parallel for InternVL vision encoder (vllm-project#23909) [Kernels] Overlap shared experts with combine instead of dispatch (vllm-project#24254) [Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models (vllm-project#24960) [Core][MM] Cleanup `MultiModalCache` (vllm-project#25006) [Docs] Clean up the contributing README (vllm-project#25099) [MM Encoder] Apply DP ViT for Qwen3-VL model series (vllm-project#24955) [Kernels] Enable DeepGEMM by default (vllm-project#24462) [V0 Deprecation] Skip PP test (vllm-project#25128) [V0 Deprecation] Remove misc V0 tests (vllm-project#25118) [V0 Deprecation] Remove V0 Tracing & Metrics tests (vllm-project#25115) ...
…lm-project#24254) Signed-off-by: Bill Nell <[email protected]>
…lm-project#24254) Signed-off-by: Bill Nell <[email protected]>
…lm-project#24254) Signed-off-by: Bill Nell <[email protected]> Signed-off-by: charlifu <[email protected]>
…lm-project#24254) Signed-off-by: Bill Nell <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…lm-project#24254) Signed-off-by: Bill Nell <[email protected]>
Purpose
Overlap shared expert computation with combine step of fused moe instead of dispatch, since combine takes longer.
Test Plan
Test Result
cc @SageMoore , @LucasWilkinson