[Kernels] Overlap shared experts with combine instead of dispatch #24254

bnellnm · 2025-09-04T14:54:29Z

Purpose

Overlap shared expert computation with combine step of fused moe instead of dispatch, since combine takes longer.

Test Plan

Tried it with deepseek

Test Result

local-completions (model=RedHatAI/DeepSeek-Coder-V2-Lite-Instruct-FP8,base_url=http://127.0.0.1:9011/v1/completions,num_concurrent=1,max_retries=3,tokenized_requests=False), gen_kwargs: (None), limit: 100.0, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.76|±  |0.0429|
|     |       |strict-match    |     5|exact_match|↑  | 0.75|±  |0.0435|

cc @SageMoore , @LucasWilkinson

gemini-code-assist

Code Review

This pull request refactors the MoE kernel to overlap shared expert computation with the combine step instead of the dispatch step, which is a sensible performance optimization as the combine step is typically more time-consuming. This is achieved by introducing a new finalize_async method to the FusedMoEPrepareAndFinalize interface. The changes are well-contained, and the implementations for different backends (DeepEP HT, DeepEP LL, PPLX) are updated accordingly. The core logic change in FusedMoEModularKernel correctly orchestrates the asynchronous finalization with the shared expert computation. My review found one issue with a type hint that should be addressed for correctness.

vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py

bnellnm · 2025-09-04T20:12:59Z

/ready

LucasWilkinson

LGTM; would be good to add lm_eval, traces and if possible perf numbers to the PR

mergify · 2025-09-16T19:11:14Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

SageMoore

Looks reasonable, @bnellnm. Just one nit.

vllm/model_executor/layers/fused_moe/modular_kernel.py

LucasWilkinson

LGTM

Signed-off-by: Bill Nell <[email protected]>

…litPR into model_register * 'model_register' of https://github.com/dsxsteven/vllm_splitPR: (138 commits) Retrieve `sliding_window` from text config in Gemma3 MM (vllm-project#25085) [Docs] Fix API Reference (vllm-project#25140) [Kernel] Better inf handling for grouped topk cu (vllm-project#24886) [CLI] Use streaming in CLI chat and completion commands (vllm-project#23769) [benchmark] add peak throughput metrics and plot (vllm-project#23867) [Spec Decode] Efficient padded speculation (vllm-project#24539) [V0 Deprecation] Remove more V0 tests (vllm-project#25117) [EPLB] Add EPLB support for hunyuan_v1 (vllm-project#23078) [XPU] Whisper model support on XPU Platform (vllm-project#25123) Mark prompt logprobs as incompatible with prompt embeds at API level (vllm-project#25077) [Model] enable data parallel for InternVL vision encoder (vllm-project#23909) [Kernels] Overlap shared experts with combine instead of dispatch (vllm-project#24254) [Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models (vllm-project#24960) [Core][MM] Cleanup `MultiModalCache` (vllm-project#25006) [Docs] Clean up the contributing README (vllm-project#25099) [MM Encoder] Apply DP ViT for Qwen3-VL model series (vllm-project#24955) [Kernels] Enable DeepGEMM by default (vllm-project#24462) [V0 Deprecation] Skip PP test (vllm-project#25128) [V0 Deprecation] Remove misc V0 tests (vllm-project#25118) [V0 Deprecation] Remove V0 Tracing & Metrics tests (vllm-project#25115) ...

…lm-project#24254) Signed-off-by: Bill Nell <[email protected]>

…lm-project#24254) Signed-off-by: Bill Nell <[email protected]> Signed-off-by: charlifu <[email protected]>

…lm-project#24254) Signed-off-by: Bill Nell <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…lm-project#24254) Signed-off-by: Bill Nell <[email protected]>

gemini-code-assist bot reviewed Sep 4, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py Outdated Show resolved Hide resolved

simon-mo added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 8, 2025

tlrmchlsmth assigned tlrmchlsmth and LucasWilkinson Sep 9, 2025

bnellnm force-pushed the overlap-combine branch from 38fe4cd to 16e7842 Compare September 9, 2025 21:47

bnellnm mentioned this pull request Sep 9, 2025

[Bug]: R1 accuracy 0 issue when all 2 all kernel is "naive" #24530

Closed

1 task

LucasWilkinson reviewed Sep 16, 2025

View reviewed changes

mergify bot added the needs-rebase label Sep 16, 2025

bnellnm force-pushed the overlap-combine branch from 7fef858 to d4a33ea Compare September 16, 2025 21:24

bnellnm requested a review from mgoin as a code owner September 16, 2025 21:24

mergify bot removed the needs-rebase label Sep 16, 2025

bnellnm requested a review from LucasWilkinson September 16, 2025 21:25

SageMoore reviewed Sep 17, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/modular_kernel.py Outdated Show resolved Hide resolved

LucasWilkinson approved these changes Sep 17, 2025

View reviewed changes

bnellnm added 8 commits September 17, 2025 18:05

[Kernels] Overlap shared experts with combine instead of dispatch

f96f854

Signed-off-by: Bill Nell <[email protected]>

don't allow inplace if shared_experts are present

1b9d444

Signed-off-by: Bill Nell <[email protected]>

update signature

7e9d423

Signed-off-by: Bill Nell <[email protected]>

fix lint

ec4bd9e

Signed-off-by: Bill Nell <[email protected]>

back out naive fix

2e5beae

Signed-off-by: Bill Nell <[email protected]>

rebase on dbo

dd50c88

Signed-off-by: Bill Nell <[email protected]>

fix layer.py merge

140015d

Signed-off-by: Bill Nell <[email protected]>

add assert

be2cd76

Signed-off-by: Bill Nell <[email protected]>

bnellnm force-pushed the overlap-combine branch from 7fc047c to be2cd76 Compare September 17, 2025 18:06

Merge branch 'main' into overlap-combine

57febb9

DarkLight1337 merged commit dc2979c into vllm-project:main Sep 18, 2025
44 checks passed

bnellnm deleted the overlap-combine branch September 18, 2025 12:37

debroy-rh pushed a commit to debroy-rh/vllm that referenced this pull request Sep 19, 2025

[Kernels] Overlap shared experts with combine instead of dispatch (vl…

8f1d451

…lm-project#24254) Signed-off-by: Bill Nell <[email protected]>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Kernels] Overlap shared experts with combine instead of dispatch (vl…

a758422

…lm-project#24254) Signed-off-by: Bill Nell <[email protected]>

charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025

[Kernels] Overlap shared experts with combine instead of dispatch (vl…

e4dd827

…lm-project#24254) Signed-off-by: Bill Nell <[email protected]> Signed-off-by: charlifu <[email protected]>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[Kernels] Overlap shared experts with combine instead of dispatch (vl…

159d021

…lm-project#24254) Signed-off-by: Bill Nell <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

[Kernels] Overlap shared experts with combine instead of dispatch (vl…

0fb5c4b

…lm-project#24254) Signed-off-by: Bill Nell <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Kernels] Overlap shared experts with combine instead of dispatch #24254

[Kernels] Overlap shared experts with combine instead of dispatch #24254

Uh oh!

bnellnm commented Sep 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

bnellnm commented Sep 4, 2025

Uh oh!

LucasWilkinson left a comment

Uh oh!

mergify bot commented Sep 16, 2025

Uh oh!

SageMoore left a comment

Uh oh!

Uh oh!

LucasWilkinson left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

[Kernels] Overlap shared experts with combine instead of dispatch #24254

[Kernels] Overlap shared experts with combine instead of dispatch #24254

Uh oh!

Conversation

bnellnm commented Sep 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

bnellnm commented Sep 4, 2025

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Sep 16, 2025

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

bnellnm commented Sep 4, 2025 •

edited by github-actions bot

Loading