Skip to content

Conversation

vllmellm
Copy link
Contributor

@vllmellm vllmellm commented Jul 22, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

The use of environment variables , especially for Aiter kernels on ROCm, has been a pain point for some users as mentioned in #21138.

This PR introduces:

  • Selection of attention backends for ROCm based on priority(performance) and support instead of environment variables.
  • Graceful handling of unsupported attention backends.

Additionally, the attention selection logic in this PR maintains the ability to force a backend thorough the VLLM_ATTENTION_BACKEND variable, allowing users to easily switch backends.

Although the selection is implemented for ROCm hardware only, it can be extended to other hardwares in future PRs.

Test Plan

Implement unit test for the backend selection function. To run, use the following command

pytest tests/attention/test_attention_selector.py

Test Result

tests/attention/test_attention_selector.py ................                                                               [100%]

====================================================== 16 passed in 2.85s =======================================================

@mergify mergify bot added rocm Related to AMD ROCm v1 labels Jul 22, 2025
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request refactors the pa rocm code. The changes include adding new parameters to the _IsSupported dataclass and is_attn_backend_supported function in vllm/attention/selector.py, modifying environment variables in vllm/envs.py, and updating attention backend selection logic in vllm/platforms/rocm.py. Additionally, new files and modifications are introduced to handle attention backends in vllm/v1/attention/backends/.

]


def choose_attention_backend(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be inside selector.py imo. Also it is rather confusing that not all existing backends are considered in this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, we are only considering attention backends used with rocm on V1. We are considering supporting all attention backends in a future PRs; however, we will add some comments to clarify for other developers.

vllm/envs.py Outdated
# and performance comparisons. Currently only affects attentions backends
# that run on ROCm (backends: AiterFlashAttentionBackend,
# TritonSplitPrefillDecodeAttentionBackend, TritonUnifiedAttentionBackend)
"VLLM_DISABLED_BACKENDS":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it is more straightforward to set VLLM_ATTENTION_BACKEND directly. If this is only used to help test the attention selector, we can directly patch global variables.

Copy link
Contributor

@tjtanaa tjtanaa Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the motivations is to follow kernels abstraction where developer can define which attention backend can run on which hardware based on the dependencies available in the environment. So, we can always pick the fastest backend for a hardware and not to always use Triton implementation as a default.

The abstraction is also to make it clear to developer and user where they should find these custom logic where defines the default behavior of vLLM's attention backend.

@mergify mergify bot added the ci/build label Jul 25, 2025
Copy link

mergify bot commented Jul 25, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vllmellm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 25, 2025
@mergify mergify bot removed the needs-rebase label Jul 25, 2025
@vllmellm vllmellm marked this pull request as ready for review July 25, 2025 11:37
Copy link

mergify bot commented Jul 25, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vllmellm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 25, 2025
@vllmellm vllmellm marked this pull request as draft July 25, 2025 17:11
@mergify mergify bot added documentation Improvements or additions to documentation deepseek Related to DeepSeek models frontend llama Related to Llama models multi-modality Related to multi-modality (#4194) new-model Requests to new models performance Performance-related issues qwen Related to Qwen models structured-output speculative-decoding tpu Related to Google TPUs labels Aug 4, 2025
@mergify mergify bot added the tool-calling label Aug 4, 2025
@mergify mergify bot removed the tpu Related to Google TPUs label Aug 4, 2025
@mergify mergify bot removed the needs-rebase label Aug 5, 2025
@vllmellm vllmellm marked this pull request as ready for review August 5, 2025 09:00
@vllmellm
Copy link
Contributor Author

vllmellm commented Aug 5, 2025

pass

@classmethod
def validate_device_capabality(cls) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this part should be handled by the platform

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or at least, it needs to accept the platform being used

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or at least, it needs to accept the platform being used
@DarkLight1337 @vllmellm
I think the second approach is better, it delegates the responsibility to the Attention class itself. All the checks if a backend is supported should be centralized in the class itself.
platform should just a place to retrieve platform information, it should not determine if an attention backend can run or not.

This can improve readability and maintainability.

Copy link

mergify bot commented Aug 7, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vllmellm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend llama Related to Llama models multi-modality Related to multi-modality (#4194) needs-rebase new-model Requests to new models performance Performance-related issues qwen Related to Qwen models rocm Related to AMD ROCm speculative-decoding structured-output tool-calling v1

Projects

Status: No status
Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants