refactor: abstract graph mode support into platform interface #25161

yiz-liu · 2025-09-18T11:58:07Z

Purpose

Introduces a support_graph_mode method to the Platform interface to centralize the logic for determining if a backend supports graph execution.

This change replaces hardcoded checks for CUDA-like or XPU platforms with a single call to the new interface method. This improves modularity and simplifies adding graph mode support for future hardware backends.

Test Plan

No further tests needed.

Test Result

None

Essential Elements of an Effective PR Description Checklist

[√] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
[×] The test plan, such as providing test command.
[×] The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request effectively refactors the platform-specific graph mode support into a unified support_graph_mode interface method. The changes in vllm/config/__init__.py, vllm/platforms/cuda.py, vllm/platforms/rocm.py, and vllm/platforms/interface.py are clean and improve modularity. However, there is a logical contradiction in the implementation for the XPU platform. I've left a specific comment with a suggestion to resolve it.

vllm/platforms/xpu.py

mergify · 2025-09-19T19:22:18Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yiz-liu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

yewentao256

LGTM, thanks for the work!
Please merge from main to solve the conflicts

ProExpertProg

Great cleanup, thanks! Can we just wait for #24281 to land first (it's time sensitive) and then rebase this PR on top of that one?

vllm/config/__init__.py

Introduces a `support_graph_mode` method to the `Platform` interface to centralize the logic for determining if a backend supports graph execution. This change replaces hardcoded checks for CUDA-like or XPU platforms with a single call to the new interface method. This improves modularity and simplifies adding graph mode support for future hardware backends. Signed-off-by: Yizhou Liu <[email protected]>

Renames the platform method to more accurately reflect that it checks for static graph support, such as CUDA graphs. Updates the XPU platform to correctly report that it does not support static graphs. The runtime fallback for `cudagraph_mode` on XPU is also replaced with an assertion to enforce correct configuration. Signed-off-by: Yizhou Liu <[email protected]>

Note: This depends on [vLLM #25161](vllm-project/vllm#25161) and the torch\_npu release from September 30. ### What this PR does / why we need it? This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA models like DeepSeek V3/R1 are not included). Key improvements include: * **Reduced dispatch latency:** By replaying the entire model execution graph at once, we cut overhead compared with multiple smaller replays. * **Stabilized multi-device performance:** Captureing the whole model as one static graph also mitigates the dispatch fluctuations across devices. * **Stream/resource savings:** Consolidating graph captures frees up streams, allowing more graphs to be captured. **Known issues:** 1. `_npu_paged_attention` currently manages its own workspace in `torch_npu`, which can deadlock when synchronizing during graph replay — we’re working on a fix. There may be other corner cases. This PR is the first in a planned series; we’ll continue to iterate and address remaining issues in follow-ups. This is essentially a port of #1503 and #1677, but includes two major changes: 1. Let `graph_dispatcher` decide the graph mode instead of hard-coding it in the backend, which decouples Full Graph and Piecewise Graph and could make it possible to remove dynamo. 2. Adapt to the new `attn_group` logic, but leave a small hack in `update_graph_params`; multi-attention models may or may not be fully supported yet. ### Does this PR introduce _any_ user-facing change? ```python compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ``` ### How was this patch tested? Tests included. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: Yizhou Liu <[email protected]>

…m-project#2128) Note: This depends on [vLLM #25161](vllm-project/vllm#25161) and the torch\_npu release from September 30. ### What this PR does / why we need it? This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA models like DeepSeek V3/R1 are not included). Key improvements include: * **Reduced dispatch latency:** By replaying the entire model execution graph at once, we cut overhead compared with multiple smaller replays. * **Stabilized multi-device performance:** Captureing the whole model as one static graph also mitigates the dispatch fluctuations across devices. * **Stream/resource savings:** Consolidating graph captures frees up streams, allowing more graphs to be captured. **Known issues:** 1. `_npu_paged_attention` currently manages its own workspace in `torch_npu`, which can deadlock when synchronizing during graph replay — we’re working on a fix. There may be other corner cases. This PR is the first in a planned series; we’ll continue to iterate and address remaining issues in follow-ups. This is essentially a port of vllm-project#1503 and vllm-project#1677, but includes two major changes: 1. Let `graph_dispatcher` decide the graph mode instead of hard-coding it in the backend, which decouples Full Graph and Piecewise Graph and could make it possible to remove dynamo. 2. Adapt to the new `attn_group` logic, but leave a small hack in `update_graph_params`; multi-attention models may or may not be fully supported yet. ### Does this PR introduce _any_ user-facing change? ```python compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ``` ### How was this patch tested? Tests included. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: Yizhou Liu <[email protected]> Signed-off-by: Che Ruan <[email protected]>

…roject#25161) Signed-off-by: Yizhou Liu <[email protected]>

…roject#25161) Signed-off-by: Yizhou Liu <[email protected]> Signed-off-by: charlifu <[email protected]>

Signed-off-by: Yizhou Liu <[email protected]> Signed-off-by: yewentao256 <[email protected]>

…roject#25161) Signed-off-by: Yizhou Liu <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…roject#25161) Signed-off-by: Yizhou Liu <[email protected]>

yiz-liu requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, jikunshang, mgoin, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256 and youkaichao as code owners September 18, 2025 11:58

mergify bot added the rocm Related to AMD ROCm label Sep 18, 2025

gemini-code-assist bot reviewed Sep 18, 2025

View reviewed changes

vllm/platforms/xpu.py Outdated Show resolved Hide resolved

zhuohan123 approved these changes Sep 18, 2025

View reviewed changes

mergify bot added the needs-rebase label Sep 19, 2025

yewentao256 approved these changes Sep 19, 2025

View reviewed changes

ProExpertProg reviewed Sep 19, 2025

View reviewed changes

vllm/config/__init__.py Outdated Show resolved Hide resolved

yiz-liu added 2 commits September 22, 2025 11:02

yiz-liu force-pushed the support-oot-graphmode branch from a9706b5 to 73270a4 Compare September 22, 2025 03:10

mergify bot removed the needs-rebase label Sep 22, 2025

yiz-liu mentioned this pull request Sep 22, 2025

[Feat][Graph] Support FULL_DECODE_ONLY mode for GQA/MHA models vllm-project/vllm-ascend#2128

Merged

DarkLight1337 enabled auto-merge (squash) September 22, 2025 09:05

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 22, 2025

DarkLight1337 merged commit b6f01bd into vllm-project:main Sep 22, 2025
51 checks passed

jikunshang mentioned this pull request Sep 23, 2025

[XPU] Fix compile_size is None case. #25433

Merged

5 tasks

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

refactor: abstract graph mode support into platform interface (vllm-p…

3bdb6bd

…roject#25161) Signed-off-by: Yizhou Liu <[email protected]>

charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025

refactor: abstract graph mode support into platform interface (vllm-p…

fd265e0

…roject#25161) Signed-off-by: Yizhou Liu <[email protected]> Signed-off-by: charlifu <[email protected]>

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

refactor: abstract graph mode support into platform interface (#25161)

cbba9bd

Signed-off-by: Yizhou Liu <[email protected]> Signed-off-by: yewentao256 <[email protected]>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

refactor: abstract graph mode support into platform interface (vllm-p…

aec4cf4

…roject#25161) Signed-off-by: Yizhou Liu <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

refactor: abstract graph mode support into platform interface (vllm-p…

7a62ae5

…roject#25161) Signed-off-by: Yizhou Liu <[email protected]>

yiz-liu deleted the support-oot-graphmode branch October 13, 2025 11:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

refactor: abstract graph mode support into platform interface #25161

refactor: abstract graph mode support into platform interface #25161

Uh oh!

yiz-liu commented Sep 18, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

mergify bot commented Sep 19, 2025

Uh oh!

yewentao256 left a comment

Uh oh!

ProExpertProg left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

refactor: abstract graph mode support into platform interface #25161

refactor: abstract graph mode support into platform interface #25161

Uh oh!

Conversation

yiz-liu commented Sep 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Sep 19, 2025

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yiz-liu commented Sep 18, 2025 •

edited by github-actions bot

Loading