[1/N][Feat] Implement primal full graph with limited scenario #1503

yiz-liu · 2025-06-28T09:10:23Z

What this PR does / why we need it?

This pull request introduces full-graph capture, replacing the previous piecewise-graph approach. Key improvements include:

Reduced dispatch latency: By capturing the entire model execution graph at once, we minimize overhead compared to multiple smaller captures.
Stabilized multi-GPU performance: Eliminates throughput fluctuations during the MODEL_EXECUTE phase across multiple cards.
Stream resource savings: Consolidating graph captures frees up streams, allowing more graphs to be captured concurrently.
Known issues:

Capturing graphs increases GPU memory usage, which can lead to OOM errors or inference hangs.
The new paged-attention implementation relies on the FIA operator, which in certain workloads is slower than the previous approach—resulting in a regression in end-to-end throughput.
There may be other undiscovered corner cases. This PR is the first in a planned series; we will continue to iterate on and address any remaining issues in subsequent submissions.

Does this PR introduce any user-facing change?

compilation_config={
    "full_cuda_graph": True,
},

How was this patch tested?

github-actions · 2025-07-04T09:05:31Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

huyz-git · 2025-07-07T01:07:35Z

Should this part also be changed when full graph is enabled? Does it still need to divide by num_hidden_layers in full graph mode?

vllm-ascend/vllm_ascend/utils.py

Lines 316 to 317 in c58accc

    
           max_num_batch_sizes = math.floor(MAX_CAPTURE_SIZE / 
        
                                            (num_hidden_layers + 1) / parallel_factor)

vllm_ascend/utils.py

Signed-off-by: Yizhou Liu <[email protected]>

…izes Signed-off-by: Yizhou Liu <[email protected]>

Signed-off-by: Yizhou Liu <[email protected]>

… common metadata Signed-off-by: Yizhou Liu <[email protected]>

Signed-off-by: Yizhou Liu <[email protected]>

…ature both enabled and disabled Signed-off-by: Yizhou Liu <[email protected]>

… mode Signed-off-by: Yizhou Liu <[email protected]>

…t#1503) This pull request introduces full-graph capture, replacing the previous piecewise-graph approach. Key improvements include: * **Reduced dispatch latency:** By capturing the entire model execution graph at once, we minimize overhead compared to multiple smaller captures. * **Stabilized multi-GPU performance:** Eliminates throughput fluctuations during the `MODEL_EXECUTE` phase across multiple cards. * **Stream resource savings:** Consolidating graph captures frees up streams, allowing more graphs to be captured concurrently. **Known issues:** 1. Capturing larger or more numerous graphs increases GPU memory usage, which can lead to OOM errors or inference hangs. 2. The new paged-attention implementation relies on the FIA operator, which in certain workloads is slower than the previous approach—resulting in a regression in end-to-end throughput. There may be other undiscovered corner cases. This PR is the first in a planned series; we will continue to iterate on and address any remaining issues in subsequent submissions. ```python compilation_config={ "full_cuda_graph": True, }, ``` --------- Signed-off-by: Yizhou Liu <[email protected]>

…th the latest design Revert "[Feat] Implement primal full graph with limited scenario (vllm-project#1503)" This reverts commit 14660be. Signed-off-by: Yizhou Liu <[email protected]>

Note: This depends on [vLLM #25161](vllm-project/vllm#25161) and the torch\_npu release from September 30. ### What this PR does / why we need it? This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA models like DeepSeek V3/R1 are not included). Key improvements include: * **Reduced dispatch latency:** By replaying the entire model execution graph at once, we cut overhead compared with multiple smaller replays. * **Stabilized multi-device performance:** Captureing the whole model as one static graph also mitigates the dispatch fluctuations across devices. * **Stream/resource savings:** Consolidating graph captures frees up streams, allowing more graphs to be captured. **Known issues:** 1. `_npu_paged_attention` currently manages its own workspace in `torch_npu`, which can deadlock when synchronizing during graph replay — we’re working on a fix. There may be other corner cases. This PR is the first in a planned series; we’ll continue to iterate and address remaining issues in follow-ups. This is essentially a port of #1503 and #1677, but includes two major changes: 1. Let `graph_dispatcher` decide the graph mode instead of hard-coding it in the backend, which decouples Full Graph and Piecewise Graph and could make it possible to remove dynamo. 2. Adapt to the new `attn_group` logic, but leave a small hack in `update_graph_params`; multi-attention models may or may not be fully supported yet. ### Does this PR introduce _any_ user-facing change? ```python compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ``` ### How was this patch tested? Tests included. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: Yizhou Liu <[email protected]>

…m-project#2128) Note: This depends on [vLLM #25161](vllm-project/vllm#25161) and the torch\_npu release from September 30. ### What this PR does / why we need it? This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA models like DeepSeek V3/R1 are not included). Key improvements include: * **Reduced dispatch latency:** By replaying the entire model execution graph at once, we cut overhead compared with multiple smaller replays. * **Stabilized multi-device performance:** Captureing the whole model as one static graph also mitigates the dispatch fluctuations across devices. * **Stream/resource savings:** Consolidating graph captures frees up streams, allowing more graphs to be captured. **Known issues:** 1. `_npu_paged_attention` currently manages its own workspace in `torch_npu`, which can deadlock when synchronizing during graph replay — we’re working on a fix. There may be other corner cases. This PR is the first in a planned series; we’ll continue to iterate and address remaining issues in follow-ups. This is essentially a port of vllm-project#1503 and vllm-project#1677, but includes two major changes: 1. Let `graph_dispatcher` decide the graph mode instead of hard-coding it in the backend, which decouples Full Graph and Piecewise Graph and could make it possible to remove dynamo. 2. Adapt to the new `attn_group` logic, but leave a small hack in `update_graph_params`; multi-attention models may or may not be fully supported yet. ### Does this PR introduce _any_ user-facing change? ```python compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ``` ### How was this patch tested? Tests included. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: Yizhou Liu <[email protected]> Signed-off-by: Che Ruan <[email protected]>

github-actions bot added the module:core label Jun 28, 2025

yiz-liu force-pushed the feat-full-graph branch 3 times, most recently from 34e1ac7 to 45d59fd Compare July 4, 2025 02:07

github-actions bot added the module:tests label Jul 4, 2025

yiz-liu force-pushed the feat-full-graph branch 2 times, most recently from 5987715 to ffdb493 Compare July 4, 2025 08:32

yiz-liu changed the title ~~[WIP][Enhancement] Implement primal full graph with limited scenario~~ [Feat] Implement primal full graph with limited scenario Jul 4, 2025

github-actions bot added the merge-conflicts label Jul 4, 2025

yiz-liu commented Jul 7, 2025

View reviewed changes

vllm_ascend/utils.py Show resolved Hide resolved

yiz-liu added 11 commits July 7, 2025 10:08

[Enhancement] Implement primal full graph with limited scenario

7327154

Signed-off-by: Yizhou Liu <[email protected]>

[Enhancement] Add FIA and fix accuracy

68b4414

Signed-off-by: Yizhou Liu <[email protected]>

[Perf] Improve performance by update params asynchronously

662ccb0

Signed-off-by: Yizhou Liu <[email protected]>

[Fix] Remove unused forward_context.current_stream

f3127dd

Signed-off-by: Yizhou Liu <[email protected]>

[Enhancement] Add seq_lens_list to attention metadata for FIA

7a6d765

Signed-off-by: Yizhou Liu <[email protected]>

[Refactor] Simplify update_attn_params call by removing threading

79820e8

Signed-off-by: Yizhou Liu <[email protected]>

[Enhancement] Add warning for full graph feature in update_aclgraph_s…

948e974

…izes Signed-off-by: Yizhou Liu <[email protected]>

[Fix] Resolve issue where the largest batch size is not being captured

8202ae3

Signed-off-by: Yizhou Liu <[email protected]>

[Refactor] Add build_dummy_metadata in attention backend and refactor…

78c2eee

… common metadata Signed-off-by: Yizhou Liu <[email protected]>

[Refactor](attention) Simplify forward and use configurable block size

e5d98f5

Signed-off-by: Yizhou Liu <[email protected]>

[Test] Adds tests to validate model generation with the full graph fe…

344fa6d

…ature both enabled and disabled Signed-off-by: Yizhou Liu <[email protected]>

yiz-liu force-pushed the feat-full-graph branch from ffdb493 to 344fa6d Compare July 7, 2025 02:10

github-actions bot removed the merge-conflicts label Jul 7, 2025

[Fix] Refactors capturing flag initialization to fix error in eager…

e4ea639

… mode Signed-off-by: Yizhou Liu <[email protected]>

yiz-liu force-pushed the feat-full-graph branch from 7062bd3 to e4ea639 Compare July 7, 2025 03:10

ganyi1996ppo approved these changes Jul 7, 2025

View reviewed changes

ganyi1996ppo merged commit 04e6169 into vllm-project:v0.9.1-dev Jul 7, 2025
16 checks passed

yiz-liu changed the title ~~[Feat] Implement primal full graph with limited scenario~~ [1/N][Feat] Implement primal full graph with limited scenario Jul 7, 2025

Yikun added the no-main label Jul 7, 2025

yiz-liu deleted the feat-full-graph branch July 8, 2025 10:56

yiz-liu mentioned this pull request Jul 7, 2025

[RFC]: Support Full Graph with multiple attention kernels #1649

Closed

yiz-liu mentioned this pull request Jul 31, 2025

[Feat][Graph] Support FULL_DECODE_ONLY mode for GQA/MHA models #2128

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[1/N][Feat] Implement primal full graph with limited scenario #1503

[1/N][Feat] Implement primal full graph with limited scenario #1503

Uh oh!

yiz-liu commented Jun 28, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 4, 2025

Uh oh!

huyz-git commented Jul 7, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[1/N][Feat] Implement primal full graph with limited scenario #1503

[1/N][Feat] Implement primal full graph with limited scenario #1503

Uh oh!

Conversation

yiz-liu commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Jul 4, 2025

Uh oh!

huyz-git commented Jul 7, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yiz-liu commented Jun 28, 2025 •

edited

Loading