Skip to content

Conversation

@yiz-liu
Copy link
Collaborator

@yiz-liu yiz-liu commented Jul 8, 2025

What this PR does / why we need it?

Rectified the performance regression wherein the FIA kernel underperformed the PA kernel by enabling dynamic updates of PA parameters during graph replay.

Does this PR introduce any user-facing change?

How was this patch tested?

@yiz-liu yiz-liu changed the title [WIP][Feat] Restore paged attention kernel for performence [WIP][Feat] Restore paged attention kernel in Full Graph for performence Jul 8, 2025
@wangxiyuan wangxiyuan changed the title [WIP][Feat] Restore paged attention kernel in Full Graph for performence [0.9.1][WIP][Feat] Restore paged attention kernel in Full Graph for performence Jul 10, 2025
@yiz-liu yiz-liu force-pushed the pa branch 3 times, most recently from 054b2db to c40c808 Compare July 11, 2025 04:12
@yiz-liu yiz-liu changed the title [0.9.1][WIP][Feat] Restore paged attention kernel in Full Graph for performence [0.9.1][2/N][Feat] Restore paged attention kernel in Full Graph for performence Jul 11, 2025
@ganyi1996ppo ganyi1996ppo merged commit df18f1d into vllm-project:v0.9.1-dev Jul 11, 2025
16 checks passed
@Yikun Yikun added the no-main label Jul 12, 2025
@yiz-liu yiz-liu deleted the pa branch July 14, 2025 03:19
@Yikun Yikun added the no-test label Jul 16, 2025
yiz-liu added a commit to yiz-liu/vllm-ascend that referenced this pull request Jul 31, 2025
…erformence (vllm-project#1677)

Rectified the performance regression wherein the FIA kernel
underperformed the PA kernel by enabling dynamic updates of PA
parameters during graph replay.

Signed-off-by: Yizhou Liu <[email protected]>
yiz-liu added a commit to yiz-liu/vllm-ascend that referenced this pull request Aug 1, 2025
…erformence (vllm-project#1677)

Rectified the performance regression wherein the FIA kernel
underperformed the PA kernel by enabling dynamic updates of PA
parameters during graph replay.

Signed-off-by: Yizhou Liu <[email protected]>
yiz-liu added a commit to yiz-liu/vllm-ascend that referenced this pull request Aug 11, 2025
…erformence (vllm-project#1677)

Rectified the performance regression wherein the FIA kernel
underperformed the PA kernel by enabling dynamic updates of PA
parameters during graph replay.

Signed-off-by: Yizhou Liu <[email protected]>
yiz-liu added a commit to yiz-liu/vllm-ascend that referenced this pull request Aug 11, 2025
…erformence (vllm-project#1677)

Rectified the performance regression wherein the FIA kernel
underperformed the PA kernel by enabling dynamic updates of PA
parameters during graph replay.

Signed-off-by: Yizhou Liu <[email protected]>
yiz-liu added a commit to yiz-liu/vllm-ascend that referenced this pull request Aug 12, 2025
…erformence (vllm-project#1677)

Rectified the performance regression wherein the FIA kernel
underperformed the PA kernel by enabling dynamic updates of PA
parameters during graph replay.

Signed-off-by: Yizhou Liu <[email protected]>
yiz-liu added a commit to yiz-liu/vllm-ascend that referenced this pull request Aug 12, 2025
…erformence (vllm-project#1677)

Rectified the performance regression wherein the FIA kernel
underperformed the PA kernel by enabling dynamic updates of PA
parameters during graph replay.

Signed-off-by: Yizhou Liu <[email protected]>
yiz-liu added a commit to yiz-liu/vllm-ascend that referenced this pull request Aug 13, 2025
…erformence (vllm-project#1677)

Rectified the performance regression wherein the FIA kernel
underperformed the PA kernel by enabling dynamic updates of PA
parameters during graph replay.

Signed-off-by: Yizhou Liu <[email protected]>
yiz-liu added a commit to yiz-liu/vllm-ascend that referenced this pull request Aug 15, 2025
…erformence (vllm-project#1677)

Rectified the performance regression wherein the FIA kernel
underperformed the PA kernel by enabling dynamic updates of PA
parameters during graph replay.

Signed-off-by: Yizhou Liu <[email protected]>
wangxiyuan pushed a commit that referenced this pull request Sep 22, 2025
Note: This depends on [vLLM
#25161](vllm-project/vllm#25161) and the
torch\_npu release from September 30.

### What this PR does / why we need it?
This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA
models like DeepSeek V3/R1 are not included). Key improvements include:

* **Reduced dispatch latency:** By replaying the entire model execution
graph at once, we cut overhead compared with multiple smaller replays.
* **Stabilized multi-device performance:** Captureing the whole model as
one static graph also mitigates the dispatch fluctuations across
devices.
* **Stream/resource savings:** Consolidating graph captures frees up
streams, allowing more graphs to be captured.

**Known issues:**

1. `_npu_paged_attention` currently manages its own workspace in
`torch_npu`, which can deadlock when synchronizing during graph replay —
we’re working on a fix.

There may be other corner cases. This PR is the first in a planned
series; we’ll continue to iterate and address remaining issues in
follow-ups.

This is essentially a port of #1503 and #1677, but includes two major
changes:

1. Let `graph_dispatcher` decide the graph mode instead of hard-coding
it in the backend, which decouples Full Graph and Piecewise Graph and
could make it possible to remove dynamo.
2. Adapt to the new `attn_group` logic, but leave a small hack in
`update_graph_params`; multi-attention models may or may not be fully
supported yet.

### Does this PR introduce _any_ user-facing change?
```python
compilation_config={
    "cudagraph_mode": "FULL_DECODE_ONLY",
},
```

### How was this patch tested?
Tests included.


- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@9607d5e

---------

Signed-off-by: Yizhou Liu <[email protected]>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Sep 22, 2025
…m-project#2128)

Note: This depends on [vLLM
#25161](vllm-project/vllm#25161) and the
torch\_npu release from September 30.

### What this PR does / why we need it?
This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA
models like DeepSeek V3/R1 are not included). Key improvements include:

* **Reduced dispatch latency:** By replaying the entire model execution
graph at once, we cut overhead compared with multiple smaller replays.
* **Stabilized multi-device performance:** Captureing the whole model as
one static graph also mitigates the dispatch fluctuations across
devices.
* **Stream/resource savings:** Consolidating graph captures frees up
streams, allowing more graphs to be captured.

**Known issues:**

1. `_npu_paged_attention` currently manages its own workspace in
`torch_npu`, which can deadlock when synchronizing during graph replay —
we’re working on a fix.

There may be other corner cases. This PR is the first in a planned
series; we’ll continue to iterate and address remaining issues in
follow-ups.

This is essentially a port of vllm-project#1503 and vllm-project#1677, but includes two major
changes:

1. Let `graph_dispatcher` decide the graph mode instead of hard-coding
it in the backend, which decouples Full Graph and Piecewise Graph and
could make it possible to remove dynamo.
2. Adapt to the new `attn_group` logic, but leave a small hack in
`update_graph_params`; multi-attention models may or may not be fully
supported yet.

### Does this PR introduce _any_ user-facing change?
```python
compilation_config={
    "cudagraph_mode": "FULL_DECODE_ONLY",
},
```

### How was this patch tested?
Tests included.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@9607d5e

---------

Signed-off-by: Yizhou Liu <[email protected]>
Signed-off-by: Che Ruan <[email protected]>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Sep 22, 2025
…m-project#2128)

Note: This depends on [vLLM
#25161](vllm-project/vllm#25161) and the
torch\_npu release from September 30.

### What this PR does / why we need it?
This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA
models like DeepSeek V3/R1 are not included). Key improvements include:

* **Reduced dispatch latency:** By replaying the entire model execution
graph at once, we cut overhead compared with multiple smaller replays.
* **Stabilized multi-device performance:** Captureing the whole model as
one static graph also mitigates the dispatch fluctuations across
devices.
* **Stream/resource savings:** Consolidating graph captures frees up
streams, allowing more graphs to be captured.

**Known issues:**

1. `_npu_paged_attention` currently manages its own workspace in
`torch_npu`, which can deadlock when synchronizing during graph replay —
we’re working on a fix.

There may be other corner cases. This PR is the first in a planned
series; we’ll continue to iterate and address remaining issues in
follow-ups.

This is essentially a port of vllm-project#1503 and vllm-project#1677, but includes two major
changes:

1. Let `graph_dispatcher` decide the graph mode instead of hard-coding
it in the backend, which decouples Full Graph and Piecewise Graph and
could make it possible to remove dynamo.
2. Adapt to the new `attn_group` logic, but leave a small hack in
`update_graph_params`; multi-attention models may or may not be fully
supported yet.

### Does this PR introduce _any_ user-facing change?
```python
compilation_config={
    "cudagraph_mode": "FULL_DECODE_ONLY",
},
```

### How was this patch tested?
Tests included.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@9607d5e

---------

Signed-off-by: Yizhou Liu <[email protected]>
Signed-off-by: Che Ruan <[email protected]>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
…m-project#2128)

Note: This depends on [vLLM
#25161](vllm-project/vllm#25161) and the
torch\_npu release from September 30.

### What this PR does / why we need it?
This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA
models like DeepSeek V3/R1 are not included). Key improvements include:

* **Reduced dispatch latency:** By replaying the entire model execution
graph at once, we cut overhead compared with multiple smaller replays.
* **Stabilized multi-device performance:** Captureing the whole model as
one static graph also mitigates the dispatch fluctuations across
devices.
* **Stream/resource savings:** Consolidating graph captures frees up
streams, allowing more graphs to be captured.

**Known issues:**

1. `_npu_paged_attention` currently manages its own workspace in
`torch_npu`, which can deadlock when synchronizing during graph replay —
we’re working on a fix.

There may be other corner cases. This PR is the first in a planned
series; we’ll continue to iterate and address remaining issues in
follow-ups.

This is essentially a port of vllm-project#1503 and vllm-project#1677, but includes two major
changes:

1. Let `graph_dispatcher` decide the graph mode instead of hard-coding
it in the backend, which decouples Full Graph and Piecewise Graph and
could make it possible to remove dynamo.
2. Adapt to the new `attn_group` logic, but leave a small hack in
`update_graph_params`; multi-attention models may or may not be fully
supported yet.

### Does this PR introduce _any_ user-facing change?
```python
compilation_config={
    "cudagraph_mode": "FULL_DECODE_ONLY",
},
```

### How was this patch tested?
Tests included.


- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@9607d5e

---------

Signed-off-by: Yizhou Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants