-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Misc] DeepEPHighThroughtput - Enable Inductor pass #21311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request re-enables the Torch Inductor pass for the DeepEPHighThroughput backend. This feature was previously disabled due to correctness issues with specific models, and the change removes the corresponding workaround. The provided test results for deepseek-v2-lite
and Qwen/Qwen3-30B-A3B-FP8
support the claim that the underlying issue has been resolved. The change is straightforward and improves both performance and code maintainability. I see no issues with this change.
Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: qizixi <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: x22x22 <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Paul Pak <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
Purpose
DeepEPHighThroughput All2All kernels + Torch Inductor pass yielded incorrect results for deepseek-v2-lite models. The Inductor pass was disabled as a result. This PR enables it again as we don't see any issues now.
Test Plan
e2e tests:
server command :
VLLM_ALL2ALL_BACKEND="deepep_high_throughput" VLLM_USE_DEEP_GEMM=1 canhazgpu run -g 2 -- vllm serve $MODEL --trust-remote-code --enable-expert-parallel --data-parallel-size 2 --port 9010 --no-enable-prefix-caching
lm_eval command :
lm_eval --model local-completions --tasks gsm8k --model_args model=$MODEL,base_url=http://127.0.0.1:9010/v1/completions,num_concurrent=30,max_retries=3 --limit 100
Test Result
Model =
deepseek-ai/DeepSeek-V2-Lite
Model =
Qwen/Qwen3-30B-A3B-FP8
Results are similar to what we see on
main
(Optional) Documentation Update