-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[ROCm] Faster Custom Paged Attention kernels #12348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Ported ROCm/vllm changes to upstream vLLM This commit manually ports changes from ROCm/vllm (ROCm#372) to upstream vLLM. The original work was done by sanyalington. Co-authored-by: sanyalington <[email protected]> Signed-off-by: vllmellm <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
9be5f70
to
f57dcb9
Compare
Signed-off-by: sanyalington <[email protected]>
f57dcb9
to
4f71b54
Compare
Regarding to the API changes of
Seeking advice on handling the variables
|
@tjtanaa Please fix the DCO error: |
…iminate the need for additional argumnets (partition_size and fp8_output_scale) in its api. Signed-off-by: vllmellm <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
… and code documentation. updated its unittest to match the correct partition size based on paged attention versions as well as platform type. Signed-off-by: vllmellm <[email protected]>
@hongxiayang We find that rebasing is hard as we had merged from |
Signed-off-by: poyenc <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: vllmellm <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
cc @houseroad This is an important feature that should be merged asap. |
Signed-off-by: tjtanaa <[email protected]>
csrc/rocm/attention.cu
Outdated
// Use int64_t for arithmetic to prevent overflow | ||
const int64_t vglobal_token_idx = | ||
static_cast<int64_t>(partition_start_token_idx) + vlocal_token_idx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit that this should would be better as the following, per conversation above
// Use int64_t for arithmetic to prevent overflow | |
const int64_t vglobal_token_idx = | |
static_cast<int64_t>(partition_start_token_idx) + vlocal_token_idx; | |
// Safe to use an int32_t here assuming we are working with < 2 billion tokens | |
const int32_t vglobal_token_idx = partition_start_token_idx + vlocal_token_idx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tlrmchlsmth We have done updating this back to int32
, but we are using int
type to represent int32_t
type. Is that ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I think that's fine especially as the rest of the file uses int
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now, thanks for the contribution! Please merge latest main as there is a conflict.
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
@tlrmchlsmth We have resolved the merge conflict. Thank you. |
@tjtanaa could you take a look at the pre-commit? It's failing as well |
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: Louis Ulmer <[email protected]>
Description
This PR implements a faster Custom Paged Attention (CPA) kernel based on
mfma16x16x16
instructions.This feature is from ROCm/vllm (ROCm#372).
End-to-End Performance gain
Model: Llama-3.1-70B-Instruct
Tensor Parallelism: 1
GPU: MI300X