[moe](feat): fuse shared expert to moe ops #734

PerryZhang01 · 2025-10-13T12:39:02Z

Purpose

these PR is for fusing shared experts into moe ops.

Test Plan

server:
export VLLM_USE_V1=1
export SAFETENSORS_FAST_GPU=1
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=1
export VLLM_USE_TRITON_FLASH_ATTN=0
export NCCL_DEBUG=WARN
export VLLM_RPC_TIMEOUT=1800000
export VLLM_ROCM_USE_AITER_ASMMOE=1
export VLLM_ROCM_USE_AITER_MHA=0
export VLLM_ROCM_USE_TRITON_ROPE=1
model_path="/mnt/raid0/zhangguopeng/models--EmbeddedLLM--deepseek-r1-FP8-Dynamic/snapshots/bba2f4ce814e9b57dc7260c8071f536b5e1bd483/"

vllm serve $model_path \
--tensor-parallel-size 8 \
--max-num-batched-tokens 32768 \
--trust-remote-code \
--no-enable-prefix-caching \
--disable-log-requests \
--enable-expert-parallel \
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
--gpu_memory_utilization 0.9 \
--block-size 1

client (for accuracy):
curl -X POST "http://localhost:8000/v1/completions" \
     -H "Content-Type: application/json" \
     -d '{
         "prompt": "The capital of China", "max_tokens": 100, "temperature": 0, "top_p": 1, "top_k": 0, "repetition_penalty": 1.0, "presence_penalty": 0, "frequency_penalty": 0, "stream": false, "ignore_eos": false, "n": 1, "seed": 123 
 }'

client (for performance):
model_path="/mnt/raid0/zhangguopeng/models--EmbeddedLLM--deepseek-r1-FP8-Dynamic/snapshots/bba2f4ce814e9b57dc7260c8071f536b5e1bd483/"

vllm serve $model_path \
--tensor-parallel-size 8 \
--max-num-batched-tokens 32768 \
--trust-remote-code \
--no-enable-prefix-caching \
--disable-log-requests \
--enable-expert-parallel \
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
--gpu_memory_utilization 0.9 \
--block-size 1

Test Result

accuracy:
{"id":"cmpl-66a4322ffead454bb9bbf2534b93a2ef","object":"text_completion","created":1760343690,"model":"/mnt/raid0/zhangguopeng/models--EmbeddedLLM--deepseek-r1-FP8-Dynamic/snapshots/bba2f4ce814e9b57dc7260c8071f536b5e1bd483/","choices":[{"index":0,"text":" is Beijing, and Shanghai is its most populous city by urban area population. China is divided into 22 provinces, five autonomous regions, four municipalities, and two semi-autonomous special administrative regions. Hong Kong and Macau are the two special administrative regions.\n\nWhat is the capital of China?\n\nBeijing is the capital of the People's Republic of China and one of the most populous cities in the world.\n\nWhat is the capital of China in 1949?\n\nOn October 1, 1949,","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":5,"total_tokens":105,"completion_tokens":100,"prompt_tokens_details":null},"kv_transfer_params":null}

performance:

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

tjtanaavllm · 2025-10-13T12:54:37Z

@PerryZhang01 Can you run lm_eval to evaluate the accuracy? Thank you.
The steps is here. It is to evaluate when batch size is not 1.
https://github.com/ROCm/vllm/blob/dev/perf/evaluation/README.md

PerryZhang01 · 2025-10-14T05:00:42Z

@PerryZhang01 Can you run lm_eval to evaluate the accuracy? Thank you. The steps is here. It is to evaluate when batch size is not 1. https://github.com/ROCm/vllm/blob/dev/perf/evaluation/README.md

tjtanaavllm · 2025-10-14T05:34:15Z

LTGM

[moe](feat): fuse shared expert to moe ops

1b149e1

PerryZhang01 requested review from charlifu, divakar-amd, gshtras, hongxiayang, maleksan85, mawong-amd, shajrawi and sunway513 as code owners October 13, 2025 12:39

wuhuikx removed request for charlifu, divakar-amd, gshtras, hongxiayang, maleksan85, mawong-amd, shajrawi and sunway513 October 13, 2025 14:37

tjtanaavllm merged commit 0724059 into dev/perf Oct 14, 2025
3 of 5 checks passed

PerryZhang01 deleted the shared_expert branch October 14, 2025 05:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[moe](feat): fuse shared expert to moe ops #734

[moe](feat): fuse shared expert to moe ops #734

Uh oh!

PerryZhang01 commented Oct 13, 2025 •

edited by github-actions bot

Loading

Uh oh!

tjtanaavllm commented Oct 13, 2025

Uh oh!

PerryZhang01 commented Oct 14, 2025

Uh oh!

tjtanaavllm commented Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[moe](feat): fuse shared expert to moe ops #734

[moe](feat): fuse shared expert to moe ops #734

Uh oh!

Conversation

PerryZhang01 commented Oct 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

tjtanaavllm commented Oct 13, 2025

Uh oh!

PerryZhang01 commented Oct 14, 2025

Uh oh!

tjtanaavllm commented Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PerryZhang01 commented Oct 13, 2025 •

edited by github-actions bot

Loading