[ROCm] Faster Custom Paged Attention kernels #12348

tjtanaa · 2025-01-23T08:02:20Z

Description

This PR implements a faster Custom Paged Attention (CPA) kernel based on mfma16x16x16 instructions.
This feature is from ROCm/vllm (ROCm#372).

End-to-End Performance gain

Model: Llama-3.1-70B-Instruct
Tensor Parallelism: 1
GPU: MI300X

CPA Version	Input length	Output length	KV-cache-dtype	Quantization	Prompt numbers	Req/s	Total Tokens/s	Output Tokens/s
before changes	128	128	fp8_e4m3	fp8	200	13.05	3340.6	1670.3
before changes	128	256	fp8_e4m3	fp8	200	7.56	2901.31	1934.21
before changes	128	2048	fp8_e4m3	fp8	200	0.78	1698.35	1598.45
before changes	512	128	fp8_e4m3	fp8	200	6.44	4122.57	824.51
before changes	512	256	fp8_e4m3	fp8	200	4.48	3443.46	1147.82
before changes	512	2048	fp8_e4m3	fp8	200	0.66	1696.64	1357.31
before changes	ShareGPT		fp8_e4m3	fp8	1000	6.22	2574.19	1234.64
optimized	128	128	fp8_e4m3	fp8	200	15.11	3867.75	1933.87
optimized	128	256	fp8_e4m3	fp8	200	9.01	3459.98	2306.65
optimized	128	2048	fp8_e4m3	fp8	200	1.2	2609.04	2455.57
optimized	512	128	fp8_e4m3	fp8	200	7.33	4694.05	938.81
optimized	512	256	fp8_e4m3	fp8	200	5.5	4223.29	1407.76
optimized	512	2048	fp8_e4m3	fp8	200	1.03	2648.55	2118.84
optimized	ShareGPT		fp8_e4m3	fp8	1000	7.45	3081.14	1477.79

github-actions · 2025-01-23T08:02:31Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Ported ROCm/vllm changes to upstream vLLM This commit manually ports changes from ROCm/vllm (ROCm#372) to upstream vLLM. The original work was done by sanyalington. Co-authored-by: sanyalington <[email protected]> Signed-off-by: vllmellm <[email protected]>

Signed-off-by: tjtanaa <[email protected]>

Signed-off-by: sanyalington <[email protected]>

tjtanaa · 2025-01-23T14:14:03Z

Regarding to the API changes of paged_attention in csrc/rocm/torch_bindings.cpp. This change only affects ROCm code path and does not interfere with code path of other platform.

 rocm_ops.def(
      "paged_attention(Tensor! out, Tensor exp_sums,"
      "                Tensor max_logits, Tensor tmp_out,"
      "                Tensor query, Tensor key_cache,"
      "                Tensor value_cache, int num_kv_heads,"
      "                float scale, Tensor block_tables,"
      "                Tensor context_lens, int block_size,"
      "                int max_context_len,"
      "                Tensor? alibi_slopes,"
      "                str kv_cache_dtype,"
      "                float k_scale, float v_scale,"
      "                Tensor? fp8_out_scale,"
      "                int partition_size) -> ()");

Seeking advice on handling the variables `fp8_out_scale` and `partition_size`.

Situation: Currently these two variables fp8_out_scale and partition_size has been introduced in the Custom Paged Attention ROCm, but they are not in used by higher level abstractions. They are set to fp8_out_scale=None and partition_size=256. The partition_size=256 has been found experimentally to be a good value for MI300.

Option 1:

Remove fp8_out_scale from csrc/rocm/attention.cu
Hard code partition_size to be 256 in csrc/rocm/attention.cu.
This avoid changing the paged_attention API in csrc/rocm/torch_bindings.cpp

~~Option 2:~~

~~Keep the variables as is, and mark TODO: for future feature update to remember introducing fp8 scaling strategy for ROCm.~~
~~Set fp8_out_scale=None and partition_size=256 when calling ops.paged_attention_rocm in vllm/attention/backends/rocm_flash_attn.py~~

We have implemented Option 1.

hongxiayang · 2025-01-23T22:40:36Z

@tjtanaa Please fix the DCO error:
Ensure you have a local copy of your branch by checking out the pull request locally via command line.
In your local branch, run: git rebase HEAD~4 --signoff
Force push your changes to overwrite the branch: git push --force-with-lease origin port-rocm-cpa-credit

…iminate the need for additional argumnets (partition_size and fp8_output_scale) in its api. Signed-off-by: vllmellm <[email protected]>

mergify · 2025-01-24T04:44:26Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tjtanaa.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: vllmellm <[email protected]>

… and code documentation. updated its unittest to match the correct partition size based on paged attention versions as well as platform type. Signed-off-by: vllmellm <[email protected]>

tjtanaa · 2025-01-27T12:31:21Z

@tjtanaa Please fix the DCO error: Ensure you have a local copy of your branch by checking out the pull request locally via command line. In your local branch, run: git rebase HEAD~4 --signoff Force push your changes to overwrite the branch: git push --force-with-lease origin port-rocm-cpa-credit

@hongxiayang We find that rebasing is hard as we had merged from main. In the process of fixing the DCO, we had to resolve merge conflict twice, and will require us to test everything again. It seems there are ways to override the DCO during merge. Could we get more input from vLLM maintainers about DCO issue.

Signed-off-by: poyenc <[email protected]>

Signed-off-by: tjtanaa <[email protected]>

Signed-off-by: vllmellm <[email protected]>

mergify · 2025-02-20T03:57:12Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tjtanaa.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

.buildkite/run-amd-test.sh

csrc/rocm/attention.cu

hongxiayang · 2025-02-21T14:37:37Z

cc @houseroad This is an important feature that should be merged asap.

Signed-off-by: tjtanaa <[email protected]>

tlrmchlsmth · 2025-02-26T14:04:51Z

csrc/rocm/attention.cu

+      // Use int64_t for arithmetic to prevent overflow
+      const int64_t vglobal_token_idx =
+          static_cast<int64_t>(partition_start_token_idx) + vlocal_token_idx;


Small nit that this should would be better as the following, per conversation above

Suggested change

// Use int64_t for arithmetic to prevent overflow

const int64_t vglobal_token_idx =

static_cast<int64_t>(partition_start_token_idx) + vlocal_token_idx;

// Safe to use an int32_t here assuming we are working with < 2 billion tokens

const int32_t vglobal_token_idx = partition_start_token_idx + vlocal_token_idx;

@tlrmchlsmth We have done updating this back to int32, but we are using int type to represent int32_t type. Is that ok?

Yep, I think that's fine especially as the rest of the file uses int

tlrmchlsmth

LGTM now, thanks for the contribution! Please merge latest main as there is a conflict.

Signed-off-by: tjtanaa <[email protected]>

tjtanaa · 2025-02-28T15:46:49Z

LGTM now, thanks for the contribution! Please merge latest main as there is a conflict.

@tlrmchlsmth We have resolved the merge conflict. Thank you.

tlrmchlsmth · 2025-02-28T15:58:48Z

@tjtanaa could you take a look at the pre-commit? It's failing as well

Signed-off-by: tjtanaa <[email protected]>

Signed-off-by: Louis Ulmer <[email protected]>

mergify bot added the ci/build label Jan 23, 2025

vllmellm and others added 2 commits January 23, 2025 08:50

format code and enable test_attention.py in AMD CI

841e678

Signed-off-by: tjtanaa <[email protected]>

tjtanaa force-pushed the port-rocm-cpa-credit branch 2 times, most recently from 9be5f70 to f57dcb9 Compare January 23, 2025 08:57

add author; update requirements-rocm.txt

4f71b54

Signed-off-by: sanyalington <[email protected]>

tjtanaa force-pushed the port-rocm-cpa-credit branch from f57dcb9 to 4f71b54 Compare January 23, 2025 09:01

tjtanaa changed the title ~~[AMD] Faster Custom Paged Attention kernels~~ [ROCm] Faster Custom Paged Attention kernels Jan 23, 2025

[Misc]: removed unnecessary condition in atttention test.

c200473

hongxiayang added the rocm Related to AMD ROCm label Jan 23, 2025

[Kernel][Hardware][AMD] refactoring rocm custom paged attention to el…

a60ae3f

…iminate the need for additional argumnets (partition_size and fp8_output_scale) in its api. Signed-off-by: vllmellm <[email protected]>

mergify bot added the needs-rebase label Jan 24, 2025

Merge remote-tracking branch 'origin/main' into port-rocm-cpa-credit

c411b72

Signed-off-by: vllmellm <[email protected]>

mergify bot removed the needs-rebase label Jan 24, 2025

vllmellm added 2 commits January 24, 2025 08:21

[kernel] fix the format.

54b0249

Signed-off-by: vllmellm <[email protected]>

[Kernel][Hardware][AMD] improved rocm custom paged attention accuracy…

a1a36f3

… and code documentation. updated its unittest to match the correct partition size based on paged attention versions as well as platform type. Signed-off-by: vllmellm <[email protected]>

tjtanaa marked this pull request as ready for review January 27, 2025 12:27

tjtanaa requested review from WoosukKwon and tlrmchlsmth as code owners January 27, 2025 12:27

tjtanaa requested review from DarkLight1337, alexm-redhat, comaniac, mgoin, robertgshaw2-redhat and simon-mo as code owners January 27, 2025 12:41

poyenc and others added 3 commits February 19, 2025 16:38

add comment to the attention.cu

f1c7f94

Signed-off-by: poyenc <[email protected]>

convert datatype to int64 to prevent overflow as per reviewed

0caafa9

Signed-off-by: tjtanaa <[email protected]>

Merge remote-tracking branch 'origin/main' into port-rocm-cpa-credit

2ca1f7a

Signed-off-by: vllmellm <[email protected]>

mergify bot removed the needs-rebase label Feb 20, 2025

mergify bot added the needs-rebase label Feb 20, 2025

tlrmchlsmth reviewed Feb 20, 2025

View reviewed changes

.buildkite/run-amd-test.sh Show resolved Hide resolved

tlrmchlsmth reviewed Feb 20, 2025

View reviewed changes

csrc/rocm/attention.cu Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Feb 20, 2025

View reviewed changes

csrc/rocm/attention.cu Outdated Show resolved Hide resolved

csrc/rocm/attention.cu Show resolved Hide resolved

remove unused functions from attention.cu

5bce980

Signed-off-by: tjtanaa <[email protected]>

hyoon1 mentioned this pull request Feb 25, 2025

[ROCm] Enable custom paged attention kernel for Navi3/4 #13843

Closed

tlrmchlsmth reviewed Feb 26, 2025

View reviewed changes

tlrmchlsmth approved these changes Feb 26, 2025

View reviewed changes

tjtanaa added 2 commits February 27, 2025 04:29

add comment to line 471 in attention.cu

65f6e3f

Signed-off-by: tjtanaa <[email protected]>

merge changes from main

00d6316

Signed-off-by: tjtanaa <[email protected]>

mergify bot removed the needs-rebase label Feb 27, 2025

fix attention.cu bug as main removed some header files

23c13c7

Signed-off-by: tjtanaa <[email protected]>

tjtanaa added 2 commits February 28, 2025 17:33

fix linter

9b22915

Signed-off-by: tjtanaa <[email protected]>

fix linter

c3f2811

Signed-off-by: tjtanaa <[email protected]>

vllm-bot merged commit 848a643 into vllm-project:main Mar 3, 2025
57 of 60 checks passed

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[ROCm] Faster Custom Paged Attention kernels (vllm-project#12348)

a1d1ef4

Signed-off-by: Louis Ulmer <[email protected]>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

hyoon1 mentioned this pull request Apr 22, 2025

[ROCm][Kernel][V1] Enable AMD Radeon GPU Custom Paged Attention on v1 #17004

Merged

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[ROCm] Faster Custom Paged Attention kernels (vllm-project#12348)

a84d26c

tjtanaa deleted the port-rocm-cpa-credit branch May 16, 2025 16:27

Uh oh!

[ROCm] Faster Custom Paged Attention kernels #12348

[ROCm] Faster Custom Paged Attention kernels #12348

Uh oh!

Conversation

tjtanaa commented Jan 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

End-to-End Performance gain

Uh oh!

github-actions bot commented Jan 23, 2025

Uh oh!

tjtanaa commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Seeking advice on handling the variables fp8_out_scale and partition_size.

Uh oh!

hongxiayang commented Jan 23, 2025

Uh oh!

mergify bot commented Jan 24, 2025

Uh oh!

tjtanaa commented Jan 27, 2025

Uh oh!

mergify bot commented Feb 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hongxiayang commented Feb 21, 2025

Uh oh!

tlrmchlsmth Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

tjtanaa Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

tjtanaa commented Feb 28, 2025

Uh oh!

tlrmchlsmth commented Feb 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

tjtanaa commented Jan 23, 2025 •

edited by github-actions bot

Loading

tjtanaa commented Jan 23, 2025 •

edited

Loading

Seeking advice on handling the variables `fp8_out_scale` and `partition_size`.