[Bug] Fix shape issue for eplb expert weights #27589

yewentao256 · 2025-10-27T16:00:18Z

Purpose

vllm serve deepseek-ai/DeepSeek-V2-lite --port=8000 --enable-expert-parallel --enable-eplb --num-redundant-experts=16 --eplb-window-size=100 --eplb-step-interval=100 --eplb-log-balancedness -dp 2

We will meet an issue

(EngineCore_DP1 pid=1223640)     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP1 pid=1223640)   File "/home/wentao/vllm-source/vllm/v1/worker/gpu_model_runner.py", line 2932, in load_model
(EngineCore_DP1 pid=1223640)     self.eplb_state = EplbState.build(
(EngineCore_DP1 pid=1223640)                       ^^^^^^^^^^^^^^^^
(EngineCore_DP1 pid=1223640)   File "/home/wentao/vllm-source/vllm/distributed/eplb/eplb_state.py", line 316, in build
(EngineCore_DP1 pid=1223640)     model.set_eplb_state(
(EngineCore_DP1 pid=1223640)   File "/home/wentao/vllm-source/vllm/model_executor/models/deepseek_v2.py", line 1252, in set_eplb_state
(EngineCore_DP1 pid=1223640)     self.expert_weights.append(layer.get_expert_weights())
(EngineCore_DP1 pid=1223640)                                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP1 pid=1223640)   File "/home/wentao/vllm-source/vllm/model_executor/layers/fused_moe/layer.py", line 1948, in get_expert_weights
(EngineCore_DP1 pid=1223640)     weight.view(self.local_num_experts, -1)
(EngineCore_DP1 pid=1223640)   File "/home/wentao/vllm-source/vllm/model_executor/parameter.py", line 126, in __torch_function__
(EngineCore_DP1 pid=1223640)     return super().__torch_function__(func, types, args, kwargs)
(EngineCore_DP1 pid=1223640)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP1 pid=1223640) RuntimeError: shape '[40, -1]' is invalid for input of size 131072
(EngineCore_DP0 pid=1223639) Traceback (most recent call last):
(EngineCore_DP0 pid=1223639)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=1223639)     self.run()
(EngineCore_DP0 pid=1223639)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=1223639)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=1223639)   File "/home/wentao/vllm-source/vllm/v1/engine/core.py", line 783, in run_engine_core
(EngineCore_DP0 pid=1223639)     raise e
(EngineCore_DP0 pid=1223639)   File "/home/wentao/vllm-source/vllm/v1/engine/core.py", line 766, in run_engine_core
(EngineCore_DP0 pid=1223639)     engine_core = DPEngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=1223639)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1223639)   File "/home/wentao/vllm-source/vllm/v1/engine/core.py", line 1061, in __init__
(EngineCore_DP0 pid=1223639)     super().__init__(
(EngineCore_DP0 pid=1223639)   File "/home/wentao/vllm-source/vllm/v1/engine/core.py", line 538, in __init__
(EngineCore_DP0 pid=1223639)     super().__init__(
(EngineCore_DP0 pid=1223639)   File "/home/wentao/vllm-source/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=1223639)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=1223639)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1223639)   File "/home/wentao/vllm-source/vllm/v1/executor/abstract.py", line 98, in __init__
(EngineCore_DP0 pid=1223639)     self._init_executor()
(EngineCore_DP0 pid=1223639)   File "/home/wentao/vllm-source/vllm/v1/executor/uniproc_executor.py", line 47, in _init_executor
(EngineCore_DP0 pid=1223639)     self.driver_worker.load_model()
(EngineCore_DP0 pid=1223639)   File "/home/wentao/vllm-source/vllm/v1/worker/gpu_worker.py", line 233, in load_model
(EngineCore_DP0 pid=1223639)     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=1223639)   File "/home/wentao/vllm-source/vllm/v1/worker/gpu_model_runner.py", line 2932, in load_model
(EngineCore_DP0 pid=1223639)     self.eplb_state = EplbState.build(
(EngineCore_DP0 pid=1223639)                       ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1223639)   File "/home/wentao/vllm-source/vllm/distributed/eplb/eplb_state.py", line 316, in build
(EngineCore_DP0 pid=1223639)     model.set_eplb_state(
(EngineCore_DP0 pid=1223639)   File "/home/wentao/vllm-source/vllm/model_executor/models/deepseek_v2.py", line 1252, in set_eplb_state
(EngineCore_DP0 pid=1223639)     self.expert_weights.append(layer.get_expert_weights())
(EngineCore_DP0 pid=1223639)                                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1223639)   File "/home/wentao/vllm-source/vllm/model_executor/layers/fused_moe/layer.py", line 1948, in get_expert_weights
(EngineCore_DP0 pid=1223639)     weight.view(self.local_num_experts, -1)
(EngineCore_DP0 pid=1223639)   File "/home/wentao/vllm-source/vllm/model_executor/parameter.py", line 126, in __torch_function__
(EngineCore_DP0 pid=1223639)     return super().__torch_function__(func, types, args, kwargs)
(EngineCore_DP0 pid=1223639)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1223639) RuntimeError: shape '[40, -1]' is invalid for input of size 131072

This PR fixes it

Test

(APIServer pid=1225413) INFO 10-27 09:00:03 [launcher.py:46] Route: /start_profile, Methods: POST
(APIServer pid=1225413) INFO 10-27 09:00:03 [launcher.py:46] Route: /stop_profile, Methods: POST
(APIServer pid=1225413) INFO 10-27 09:00:03 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1225413) INFO:     Started server process [1225413]
(APIServer pid=1225413) INFO:     Waiting for application startup.
(APIServer pid=1225413) INFO:     Application startup complete.

Signed-off-by: yewentao256 <[email protected]>

gemini-code-assist

Code Review

The pull request introduces a minor change to vllm/model_executor/layers/fused_moe/layer.py to exclude parameters from non-expert submodules (e.g., gate/shared) when retrieving expert weights. This change addresses a shape issue encountered when using EPLB with DeepSeek-V2-lite. I have added a high severity review comment to ensure the change is correct.

vllm/model_executor/layers/fused_moe/layer.py

Signed-off-by: yewentao256 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Signed-off-by: Bhagyashri <[email protected]>

Signed-off-by: yewentao256 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

fix shape issue for eplb expert weights

b347410

Signed-off-by: yewentao256 <[email protected]>

yewentao256 requested review from mgoin and pavanimajety as code owners October 27, 2025 16:00

yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 27, 2025

gemini-code-assist bot reviewed Oct 27, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Show resolved Hide resolved

Merge branch 'main' into wentao-fix-shape-issue-for-eplb-expert

d1477b3

tlrmchlsmth approved these changes Oct 27, 2025

View reviewed changes

Merge branch 'main' into wentao-fix-shape-issue-for-eplb-expert

ea11c94

DarkLight1337 merged commit 0484b64 into main Oct 28, 2025
53 checks passed

DarkLight1337 deleted the wentao-fix-shape-issue-for-eplb-expert branch October 28, 2025 12:44

Kay-Tian mentioned this pull request Oct 28, 2025

vLLM PR #27589 变更核心文件提醒 Kay-Tian/vllm#57

Closed

ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Nov 7, 2025

[Bug] Fix shape issue for eplb expert weights (vllm-project#27589)

c31d135

Signed-off-by: yewentao256 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025

[Bug] Fix shape issue for eplb expert weights (vllm-project#27589)

cf3616b

Signed-off-by: yewentao256 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Bug] Fix shape issue for eplb expert weights (vllm-project#27589)

0b18c2f

Signed-off-by: yewentao256 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Bug] Fix shape issue for eplb expert weights (vllm-project#27589)

8faa853

Signed-off-by: yewentao256 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug] Fix shape issue for eplb expert weights #27589

[Bug] Fix shape issue for eplb expert weights #27589

Uh oh!

yewentao256 commented Oct 27, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[Bug] Fix shape issue for eplb expert weights #27589

[Bug] Fix shape issue for eplb expert weights #27589

Uh oh!

Conversation

yewentao256 commented Oct 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yewentao256 commented Oct 27, 2025 •

edited by github-actions bot

Loading