-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[ROCm] Auto-Select Attention Backend #21366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request refactors the pa rocm code. The changes include adding new parameters to the _IsSupported
dataclass and is_attn_backend_supported
function in vllm/attention/selector.py
, modifying environment variables in vllm/envs.py
, and updating attention backend selection logic in vllm/platforms/rocm.py
. Additionally, new files and modifications are introduced to handle attention backends in vllm/v1/attention/backends/
.
Signed-off-by: vllmellm <[email protected]>
] | ||
|
||
|
||
def choose_attention_backend( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be inside selector.py
imo. Also it is rather confusing that not all existing backends are considered in this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, we are only considering attention backends used with rocm on V1. We are considering supporting all attention backends in a future PRs; however, we will add some comments to clarify for other developers.
vllm/envs.py
Outdated
# and performance comparisons. Currently only affects attentions backends | ||
# that run on ROCm (backends: AiterFlashAttentionBackend, | ||
# TritonSplitPrefillDecodeAttentionBackend, TritonUnifiedAttentionBackend) | ||
"VLLM_DISABLED_BACKENDS": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it is more straightforward to set VLLM_ATTENTION_BACKEND
directly. If this is only used to help test the attention selector, we can directly patch global variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the motivations is to follow kernels abstraction where developer can define which attention backend can run on which hardware based on the dependencies available in the environment. So, we can always pick the fastest backend for a hardware and not to always use Triton implementation as a default.
The abstraction is also to make it clear to developer and user where they should find these custom logic where defines the default behavior of vLLM's attention backend.
…ropriate file Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
1e89f73
to
1ba6982
Compare
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
pass | ||
|
||
@classmethod | ||
def validate_device_capabality(cls) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this part should be handled by the platform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or at least, it needs to accept the platform
being used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or at least, it needs to accept the platform being used
@DarkLight1337 @vllmellm
I think the second approach is better, it delegates the responsibility to the Attention class itself. All the checks if a backend is supported should be centralized in the class itself.
platform
should just a place to retrieve platform information, it should not determine if an attention backend can run or not.
This can improve readability and maintainability.
This pull request has merge conflicts that must be resolved before it can be |
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
The use of environment variables , especially for Aiter kernels on ROCm, has been a pain point for some users as mentioned in #21138.
This PR introduces:
Additionally, the attention selection logic in this PR maintains the ability to force a backend thorough the
VLLM_ATTENTION_BACKEND
variable, allowing users to easily switch backends.Although the selection is implemented for ROCm hardware only, it can be extended to other hardwares in future PRs.
Test Plan
Implement unit test for the backend selection function. To run, use the following command
Test Result