-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
refactor: abstract graph mode support into platform interface #25161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: abstract graph mode support into platform interface #25161
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively refactors the platform-specific graph mode support into a unified support_graph_mode
interface method. The changes in vllm/config/__init__.py
, vllm/platforms/cuda.py
, vllm/platforms/rocm.py
, and vllm/platforms/interface.py
are clean and improve modularity. However, there is a logical contradiction in the implementation for the XPU platform. I've left a specific comment with a suggestion to resolve it.
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the work!
Please merge from main to solve the conflicts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great cleanup, thanks! Can we just wait for #24281 to land first (it's time sensitive) and then rebase this PR on top of that one?
Introduces a `support_graph_mode` method to the `Platform` interface to centralize the logic for determining if a backend supports graph execution. This change replaces hardcoded checks for CUDA-like or XPU platforms with a single call to the new interface method. This improves modularity and simplifies adding graph mode support for future hardware backends. Signed-off-by: Yizhou Liu <[email protected]>
Renames the platform method to more accurately reflect that it checks for static graph support, such as CUDA graphs. Updates the XPU platform to correctly report that it does not support static graphs. The runtime fallback for `cudagraph_mode` on XPU is also replaced with an assertion to enforce correct configuration. Signed-off-by: Yizhou Liu <[email protected]>
a9706b5
to
73270a4
Compare
Note: This depends on [vLLM #25161](vllm-project/vllm#25161) and the torch\_npu release from September 30. ### What this PR does / why we need it? This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA models like DeepSeek V3/R1 are not included). Key improvements include: * **Reduced dispatch latency:** By replaying the entire model execution graph at once, we cut overhead compared with multiple smaller replays. * **Stabilized multi-device performance:** Captureing the whole model as one static graph also mitigates the dispatch fluctuations across devices. * **Stream/resource savings:** Consolidating graph captures frees up streams, allowing more graphs to be captured. **Known issues:** 1. `_npu_paged_attention` currently manages its own workspace in `torch_npu`, which can deadlock when synchronizing during graph replay — we’re working on a fix. There may be other corner cases. This PR is the first in a planned series; we’ll continue to iterate and address remaining issues in follow-ups. This is essentially a port of #1503 and #1677, but includes two major changes: 1. Let `graph_dispatcher` decide the graph mode instead of hard-coding it in the backend, which decouples Full Graph and Piecewise Graph and could make it possible to remove dynamo. 2. Adapt to the new `attn_group` logic, but leave a small hack in `update_graph_params`; multi-attention models may or may not be fully supported yet. ### Does this PR introduce _any_ user-facing change? ```python compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ``` ### How was this patch tested? Tests included. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: Yizhou Liu <[email protected]>
…m-project#2128) Note: This depends on [vLLM #25161](vllm-project/vllm#25161) and the torch\_npu release from September 30. ### What this PR does / why we need it? This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA models like DeepSeek V3/R1 are not included). Key improvements include: * **Reduced dispatch latency:** By replaying the entire model execution graph at once, we cut overhead compared with multiple smaller replays. * **Stabilized multi-device performance:** Captureing the whole model as one static graph also mitigates the dispatch fluctuations across devices. * **Stream/resource savings:** Consolidating graph captures frees up streams, allowing more graphs to be captured. **Known issues:** 1. `_npu_paged_attention` currently manages its own workspace in `torch_npu`, which can deadlock when synchronizing during graph replay — we’re working on a fix. There may be other corner cases. This PR is the first in a planned series; we’ll continue to iterate and address remaining issues in follow-ups. This is essentially a port of vllm-project#1503 and vllm-project#1677, but includes two major changes: 1. Let `graph_dispatcher` decide the graph mode instead of hard-coding it in the backend, which decouples Full Graph and Piecewise Graph and could make it possible to remove dynamo. 2. Adapt to the new `attn_group` logic, but leave a small hack in `update_graph_params`; multi-attention models may or may not be fully supported yet. ### Does this PR introduce _any_ user-facing change? ```python compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ``` ### How was this patch tested? Tests included. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: Yizhou Liu <[email protected]> Signed-off-by: Che Ruan <[email protected]>
…m-project#2128) Note: This depends on [vLLM #25161](vllm-project/vllm#25161) and the torch\_npu release from September 30. ### What this PR does / why we need it? This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA models like DeepSeek V3/R1 are not included). Key improvements include: * **Reduced dispatch latency:** By replaying the entire model execution graph at once, we cut overhead compared with multiple smaller replays. * **Stabilized multi-device performance:** Captureing the whole model as one static graph also mitigates the dispatch fluctuations across devices. * **Stream/resource savings:** Consolidating graph captures frees up streams, allowing more graphs to be captured. **Known issues:** 1. `_npu_paged_attention` currently manages its own workspace in `torch_npu`, which can deadlock when synchronizing during graph replay — we’re working on a fix. There may be other corner cases. This PR is the first in a planned series; we’ll continue to iterate and address remaining issues in follow-ups. This is essentially a port of vllm-project#1503 and vllm-project#1677, but includes two major changes: 1. Let `graph_dispatcher` decide the graph mode instead of hard-coding it in the backend, which decouples Full Graph and Piecewise Graph and could make it possible to remove dynamo. 2. Adapt to the new `attn_group` logic, but leave a small hack in `update_graph_params`; multi-attention models may or may not be fully supported yet. ### Does this PR introduce _any_ user-facing change? ```python compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ``` ### How was this patch tested? Tests included. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: Yizhou Liu <[email protected]> Signed-off-by: Che Ruan <[email protected]>
…roject#25161) Signed-off-by: Yizhou Liu <[email protected]>
…roject#25161) Signed-off-by: Yizhou Liu <[email protected]> Signed-off-by: charlifu <[email protected]>
Signed-off-by: Yizhou Liu <[email protected]> Signed-off-by: yewentao256 <[email protected]>
…roject#25161) Signed-off-by: Yizhou Liu <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…roject#25161) Signed-off-by: Yizhou Liu <[email protected]>
Purpose
Introduces a
support_graph_mode
method to thePlatform
interface to centralize the logic for determining if a backend supports graph execution.This change replaces hardcoded checks for CUDA-like or XPU platforms with a single call to the new interface method. This improves modularity and simplifies adding graph mode support for future hardware backends.
Test Plan
No further tests needed.
Test Result
None
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.