Skip to content

Conversation

@PerryZhang01
Copy link

@PerryZhang01 PerryZhang01 commented Oct 13, 2025

Purpose

these PR is for fusing shared experts into moe ops.

Test Plan

server:
export VLLM_USE_V1=1
export SAFETENSORS_FAST_GPU=1
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=1
export VLLM_USE_TRITON_FLASH_ATTN=0
export NCCL_DEBUG=WARN
export VLLM_RPC_TIMEOUT=1800000
export VLLM_ROCM_USE_AITER_ASMMOE=1
export VLLM_ROCM_USE_AITER_MHA=0
export VLLM_ROCM_USE_TRITON_ROPE=1
model_path="/mnt/raid0/zhangguopeng/models--EmbeddedLLM--deepseek-r1-FP8-Dynamic/snapshots/bba2f4ce814e9b57dc7260c8071f536b5e1bd483/"

vllm serve $model_path \
--tensor-parallel-size 8 \
--max-num-batched-tokens 32768 \
--trust-remote-code \
--no-enable-prefix-caching \
--disable-log-requests \
--enable-expert-parallel \
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
--gpu_memory_utilization 0.9 \
--block-size 1

client (for accuracy):
curl -X POST "http://localhost:8000/v1/completions" \
     -H "Content-Type: application/json" \
     -d '{
         "prompt": "The capital of China", "max_tokens": 100, "temperature": 0, "top_p": 1, "top_k": 0, "repetition_penalty": 1.0, "presence_penalty": 0, "frequency_penalty": 0, "stream": false, "ignore_eos": false, "n": 1, "seed": 123 
 }'

client (for performance):
model_path="/mnt/raid0/zhangguopeng/models--EmbeddedLLM--deepseek-r1-FP8-Dynamic/snapshots/bba2f4ce814e9b57dc7260c8071f536b5e1bd483/"

vllm serve $model_path \
--tensor-parallel-size 8 \
--max-num-batched-tokens 32768 \
--trust-remote-code \
--no-enable-prefix-caching \
--disable-log-requests \
--enable-expert-parallel \
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
--gpu_memory_utilization 0.9 \
--block-size 1

Test Result

accuracy:
{"id":"cmpl-66a4322ffead454bb9bbf2534b93a2ef","object":"text_completion","created":1760343690,"model":"/mnt/raid0/zhangguopeng/models--EmbeddedLLM--deepseek-r1-FP8-Dynamic/snapshots/bba2f4ce814e9b57dc7260c8071f536b5e1bd483/","choices":[{"index":0,"text":" is Beijing, and Shanghai is its most populous city by urban area population. China is divided into 22 provinces, five autonomous regions, four municipalities, and two semi-autonomous special administrative regions. Hong Kong and Macau are the two special administrative regions.\n\nWhat is the capital of China?\n\nBeijing is the capital of the People's Republic of China and one of the most populous cities in the world.\n\nWhat is the capital of China in 1949?\n\nOn October 1, 1949,","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":5,"total_tokens":105,"completion_tokens":100,"prompt_tokens_details":null},"kv_transfer_params":null}

performance:
image


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@tjtanaavllm
Copy link

@PerryZhang01 Can you run lm_eval to evaluate the accuracy? Thank you.
The steps is here. It is to evaluate when batch size is not 1.
https://github.com/ROCm/vllm/blob/dev/perf/evaluation/README.md

@PerryZhang01
Copy link
Author

@PerryZhang01 Can you run lm_eval to evaluate the accuracy? Thank you. The steps is here. It is to evaluate when batch size is not 1. https://github.com/ROCm/vllm/blob/dev/perf/evaluation/README.md

image

@tjtanaavllm
Copy link

LTGM

@tjtanaavllm tjtanaavllm merged commit 0724059 into dev/perf Oct 14, 2025
3 of 5 checks passed
@PerryZhang01 PerryZhang01 deleted the shared_expert branch October 14, 2025 05:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants