Skip to content

Conversation

yiz-liu
Copy link
Contributor

@yiz-liu yiz-liu commented Sep 18, 2025

Purpose

Introduces a support_graph_mode method to the Platform interface to centralize the logic for determining if a backend supports graph execution.

This change replaces hardcoded checks for CUDA-like or XPU platforms with a single call to the new interface method. This improves modularity and simplifies adding graph mode support for future hardware backends.

Test Plan

No further tests needed.

Test Result

None


Essential Elements of an Effective PR Description Checklist
  • [√] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • [×] The test plan, such as providing test command.
  • [×] The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively refactors the platform-specific graph mode support into a unified support_graph_mode interface method. The changes in vllm/config/__init__.py, vllm/platforms/cuda.py, vllm/platforms/rocm.py, and vllm/platforms/interface.py are clean and improve modularity. However, there is a logical contradiction in the implementation for the XPU platform. I've left a specific comment with a suggestion to resolve it.

Copy link

mergify bot commented Sep 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yiz-liu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 19, 2025
Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work!
Please merge from main to solve the conflicts

Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great cleanup, thanks! Can we just wait for #24281 to land first (it's time sensitive) and then rebase this PR on top of that one?

Introduces a `support_graph_mode` method to the `Platform` interface to centralize the logic for determining if a backend supports graph execution.

This change replaces hardcoded checks for CUDA-like or XPU platforms with a single call to the new interface method. This improves modularity and simplifies adding graph mode support for future hardware backends.

Signed-off-by: Yizhou Liu <[email protected]>
Renames the platform method to more accurately reflect that it checks for static graph support, such as CUDA graphs.

Updates the XPU platform to correctly report that it does not support static graphs. The runtime fallback for `cudagraph_mode` on XPU is also replaced with an assertion to enforce correct configuration.

Signed-off-by: Yizhou Liu <[email protected]>
@yiz-liu yiz-liu force-pushed the support-oot-graphmode branch from a9706b5 to 73270a4 Compare September 22, 2025 03:10
@mergify mergify bot removed the needs-rebase label Sep 22, 2025
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) September 22, 2025 09:05
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 22, 2025
wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request Sep 22, 2025
Note: This depends on [vLLM
#25161](vllm-project/vllm#25161) and the
torch\_npu release from September 30.

### What this PR does / why we need it?
This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA
models like DeepSeek V3/R1 are not included). Key improvements include:

* **Reduced dispatch latency:** By replaying the entire model execution
graph at once, we cut overhead compared with multiple smaller replays.
* **Stabilized multi-device performance:** Captureing the whole model as
one static graph also mitigates the dispatch fluctuations across
devices.
* **Stream/resource savings:** Consolidating graph captures frees up
streams, allowing more graphs to be captured.

**Known issues:**

1. `_npu_paged_attention` currently manages its own workspace in
`torch_npu`, which can deadlock when synchronizing during graph replay —
we’re working on a fix.

There may be other corner cases. This PR is the first in a planned
series; we’ll continue to iterate and address remaining issues in
follow-ups.

This is essentially a port of #1503 and #1677, but includes two major
changes:

1. Let `graph_dispatcher` decide the graph mode instead of hard-coding
it in the backend, which decouples Full Graph and Piecewise Graph and
could make it possible to remove dynamo.
2. Adapt to the new `attn_group` logic, but leave a small hack in
`update_graph_params`; multi-attention models may or may not be fully
supported yet.

### Does this PR introduce _any_ user-facing change?
```python
compilation_config={
    "cudagraph_mode": "FULL_DECODE_ONLY",
},
```

### How was this patch tested?
Tests included.


- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@9607d5e

---------

Signed-off-by: Yizhou Liu <[email protected]>
@DarkLight1337 DarkLight1337 merged commit b6f01bd into vllm-project:main Sep 22, 2025
51 checks passed
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Sep 22, 2025
…m-project#2128)

Note: This depends on [vLLM
#25161](vllm-project/vllm#25161) and the
torch\_npu release from September 30.

### What this PR does / why we need it?
This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA
models like DeepSeek V3/R1 are not included). Key improvements include:

* **Reduced dispatch latency:** By replaying the entire model execution
graph at once, we cut overhead compared with multiple smaller replays.
* **Stabilized multi-device performance:** Captureing the whole model as
one static graph also mitigates the dispatch fluctuations across
devices.
* **Stream/resource savings:** Consolidating graph captures frees up
streams, allowing more graphs to be captured.

**Known issues:**

1. `_npu_paged_attention` currently manages its own workspace in
`torch_npu`, which can deadlock when synchronizing during graph replay —
we’re working on a fix.

There may be other corner cases. This PR is the first in a planned
series; we’ll continue to iterate and address remaining issues in
follow-ups.

This is essentially a port of vllm-project#1503 and vllm-project#1677, but includes two major
changes:

1. Let `graph_dispatcher` decide the graph mode instead of hard-coding
it in the backend, which decouples Full Graph and Piecewise Graph and
could make it possible to remove dynamo.
2. Adapt to the new `attn_group` logic, but leave a small hack in
`update_graph_params`; multi-attention models may or may not be fully
supported yet.

### Does this PR introduce _any_ user-facing change?
```python
compilation_config={
    "cudagraph_mode": "FULL_DECODE_ONLY",
},
```

### How was this patch tested?
Tests included.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@9607d5e

---------

Signed-off-by: Yizhou Liu <[email protected]>
Signed-off-by: Che Ruan <[email protected]>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Sep 22, 2025
…m-project#2128)

Note: This depends on [vLLM
#25161](vllm-project/vllm#25161) and the
torch\_npu release from September 30.

### What this PR does / why we need it?
This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA
models like DeepSeek V3/R1 are not included). Key improvements include:

* **Reduced dispatch latency:** By replaying the entire model execution
graph at once, we cut overhead compared with multiple smaller replays.
* **Stabilized multi-device performance:** Captureing the whole model as
one static graph also mitigates the dispatch fluctuations across
devices.
* **Stream/resource savings:** Consolidating graph captures frees up
streams, allowing more graphs to be captured.

**Known issues:**

1. `_npu_paged_attention` currently manages its own workspace in
`torch_npu`, which can deadlock when synchronizing during graph replay —
we’re working on a fix.

There may be other corner cases. This PR is the first in a planned
series; we’ll continue to iterate and address remaining issues in
follow-ups.

This is essentially a port of vllm-project#1503 and vllm-project#1677, but includes two major
changes:

1. Let `graph_dispatcher` decide the graph mode instead of hard-coding
it in the backend, which decouples Full Graph and Piecewise Graph and
could make it possible to remove dynamo.
2. Adapt to the new `attn_group` logic, but leave a small hack in
`update_graph_params`; multi-attention models may or may not be fully
supported yet.

### Does this PR introduce _any_ user-facing change?
```python
compilation_config={
    "cudagraph_mode": "FULL_DECODE_ONLY",
},
```

### How was this patch tested?
Tests included.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@9607d5e

---------

Signed-off-by: Yizhou Liu <[email protected]>
Signed-off-by: Che Ruan <[email protected]>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
@yiz-liu yiz-liu deleted the support-oot-graphmode branch October 13, 2025 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants