-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
[EPLB] Add EPLB support for hunyuan_v1 #23078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds EPLB support for the hunyuan_v1
model, which is a great enhancement for improving throughput in MoE models. The changes primarily involve adapting the model to the MixtureOfExperts
interface and handling the complexities of expert weight loading with redundancy. The implementation is mostly solid, but I've identified a couple of issues in the new interface methods that could lead to incorrect behavior, especially in dynamic environments. One is a potential issue with state accumulation in set_eplb_state
, and the other is a critical assertion in update_physical_experts_metadata
that would prevent dynamic scaling of workers. Please see my detailed comments for suggestions on how to fix these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion prevents dynamic scaling of the number of expert parallel workers. The number of local physical experts can change during runtime, for example, when scaling the cluster up or down. This assertion will fail in such scenarios, preventing the model from adapting to the new configuration. This should be removed to allow for dynamic changes in the number of local experts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The self.expert_weights
list is appended to in this loop without being cleared first. If set_eplb_state
is called multiple times (e.g., during re-initialization or complex state updates), this will lead to an accumulation of expert weights. This is likely not the intended behavior and can cause issues during expert rebalancing. You should clear the list at the beginning of this method to ensure it only contains the weights from the current state, similar to how it's done in other MoE model implementations in this repository.
def set_eplb_state( | |
self, | |
expert_load_view: torch.Tensor, | |
logical_to_physical_map: torch.Tensor, | |
logical_replica_count: torch.Tensor, | |
) -> None: | |
for layer_idx, layer in enumerate(self.moe_layers): | |
self.expert_weights.append(layer.get_expert_weights()) | |
# Register the expert weights. | |
layer.set_eplb_state( | |
moe_layer_idx=layer_idx, | |
expert_load_view=expert_load_view, | |
logical_to_physical_map=logical_to_physical_map, | |
logical_replica_count=logical_replica_count, | |
) | |
def set_eplb_state( | |
self, | |
expert_load_view: torch.Tensor, | |
logical_to_physical_map: torch.Tensor, | |
logical_replica_count: torch.Tensor, | |
) -> None: | |
self.expert_weights.clear() | |
for layer_idx, layer in enumerate(self.moe_layers): | |
self.expert_weights.append(layer.get_expert_weights()) | |
# Register the expert weights. | |
layer.set_eplb_state( | |
moe_layer_idx=layer_idx, | |
expert_load_view=expert_load_view, | |
logical_to_physical_map=logical_to_physical_map, | |
logical_replica_count=logical_replica_count, | |
) |
@abmfy This PR adds EPLB support to hunyuan. I used tencent/Hunyuan-A13B-Instruct-FP8 for testing and had to skip w13_input_scale and w2_input_scale in get_expert_weights as they are scalars and cannot be reshaped to (self.local_num_experts, -1). Let me know if you have other suggestions :) |
Thanks for the contribution! I haven’t had time to run an accuracy test yet, but you shouldn’t skip the scale tensors -- otherwise, the weights and scales of the physical experts will become misaligned after a rearrangement. If I recall correctly, the scale tensor looks like this: w13_input_scale = torch.nn.Parameter(
torch.ones(num_experts, dtype=torch.float32),
requires_grad=False,
) So it’s still a tensor with a dimension of Could you attach an accuracy test on GSM8K after fixing this issue? |
Thanks for the review @abmfy! I do see that w13_input_scale is set to torch.ones(num_experts, dtype=torch.float32) in fp8.py. However when I print out tensor shapes in layer.py it shows that w13_input_scale is a scalar I will try to debug this issue. Let me know if you have any insights in the meantime :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @666even666, it looks like in fp8.py#L797-L800 the input_scale
s are converted into a single-element tensor when the activation scheme is static
, which is the case for Hunyuan-FP8 models.
In this case, these scales don’t need to be transferred since they’re shared across the whole layer. However, we shouldn’t skip them unconditionally, as that might break models using the dynamic
activation scheme. I think we should instead add a check to skip only one-element tensors (with shape []
). What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for spotting it @abmfy ! I have added the scalar check as suggested. BTW is there a recommended way to run accuracy test on GSM8K (e.g. existing test script/process) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been using this one and compare the numbers before/after enabling EPLB:
PORT=${PORT:-8080}
MODEL="MODEL_NAME_HERE"
OPENAI_API_KEY=EMPTY lm_eval \
--model local-completions \
--model_args "base_url=http://localhost:$PORT/v1/completions,model=$MODEL,num_concurrent=4096,trust_remote_code=True" \
--tasks gsm8k \
--output_path results/gsm8k.json
You'll need to uv pip install lm_eval[api]
first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for the suggestion! I have attached test results below :)
Hi @abmfy Here is the accuracy test result on GSM8K |
Signed-off-by: Yiwen Chen <[email protected]>
Signed-off-by: Yiwen Chen <[email protected]>
Signed-off-by: Yiwen Chen <[email protected]>
Signed-off-by: Yiwen Chen <[email protected]>
Signed-off-by: Yiwen Chen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the contribution 🎉
One last thing: have you confirmed that the balancedness score increases with log_balancedness: true
set in EplbConfig
? If so, I think we're good to go and can merge once CI passes.
cc @simon-mo Could you please approve to trigger CI? Thank you!
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: YiwenC <[email protected]>
@abmfy Thanks for the review! From the screenshot I took, I do see the balancedness score increased but not in a monotonic way. Do you think this is expected? (I am using ep size of 2) |
The figures here aren’t very informative since we only used a single prompt with EP=2. Ideally, we should run benchmarks with larger batches and a larger-scale EP setup; otherwise, the improvements brought by EPLB may not be very apparent. That said, I believe the model adaptation is working well, so we’re good to proceed. Thanks again for the contribution! 🎉 |
Signed-off-by: Yiwen Chen <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: YiwenC <[email protected]>
Ahh I see. That makes a lot of sense. Thanks a lot @abmfy for your help and support!! |
Signed-off-by: Yiwen Chen <[email protected]>
Looks like we still need another review to merge the pr. @abmfy Do you know someone who can help review the change :) |
@simon-mo Could you please take a look and help merge this if it doesn’t affect other parts of the codebase? Thanks a lot! |
…litPR into model_register * 'model_register' of https://github.com/dsxsteven/vllm_splitPR: (138 commits) Retrieve `sliding_window` from text config in Gemma3 MM (vllm-project#25085) [Docs] Fix API Reference (vllm-project#25140) [Kernel] Better inf handling for grouped topk cu (vllm-project#24886) [CLI] Use streaming in CLI chat and completion commands (vllm-project#23769) [benchmark] add peak throughput metrics and plot (vllm-project#23867) [Spec Decode] Efficient padded speculation (vllm-project#24539) [V0 Deprecation] Remove more V0 tests (vllm-project#25117) [EPLB] Add EPLB support for hunyuan_v1 (vllm-project#23078) [XPU] Whisper model support on XPU Platform (vllm-project#25123) Mark prompt logprobs as incompatible with prompt embeds at API level (vllm-project#25077) [Model] enable data parallel for InternVL vision encoder (vllm-project#23909) [Kernels] Overlap shared experts with combine instead of dispatch (vllm-project#24254) [Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models (vllm-project#24960) [Core][MM] Cleanup `MultiModalCache` (vllm-project#25006) [Docs] Clean up the contributing README (vllm-project#25099) [MM Encoder] Apply DP ViT for Qwen3-VL model series (vllm-project#24955) [Kernels] Enable DeepGEMM by default (vllm-project#24462) [V0 Deprecation] Skip PP test (vllm-project#25128) [V0 Deprecation] Remove misc V0 tests (vllm-project#25118) [V0 Deprecation] Remove V0 Tracing & Metrics tests (vllm-project#25115) ...
Signed-off-by: charlifu <[email protected]>
Signed-off-by: xuebwang-amd <[email protected]>
Purpose
This pull request add EPLB support for hunyuan_v1 model, which helps to improve the overall thoughput during LLM Serving.
#20468
Test Plan
execute
Test Result
(Optional) Documentation Update
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.