[RFC][Feature]: Unified Auto-Selection Mechanism for Attention Backends

### 🚀 The feature, motivation and pitch

# Summary
This RFC proposes introducing a robust, unified mechanism for automatic selection of attention backends in vLLM, initially focused on ROCm, with a roadmap for extending support to all other major backends. The primary goal is to simplify user experience and ensure optimal backend selection based on hardware, available dependencies, and configuration, while retaining the ability for manual override when needed.
# Motivation
Manually configuring and selecting the most suitable attention backend has been a significant pain point for ROCm users. The “best performing” attention backend, AiterFlashAttention, requires explicit activation via the environment variable VLLM_ROCM_USE_AITER (False by default). While this was necessary before as the AiterFlashAttentionBackend implementation was “experimental”; However, the AiterFlashAttentionBackend has now been validated and tested on many models and achieves better performance over the Triton implementation. On another note, the TritonAttentionBackend supports two different modes, a “unified attention” mode and a “split prefill decode” mode, with the former being the default backend despite the latter being faster. This results in an awkward selection logic for attention backends on ROCm where vLLM defaults to the slowest backend and must opt-in to use the fastest backend.

```bash
# To use the fastest backend
VLLM_ROCM_USE_AITER=1
```
```bash
# To use the second fastest backend (in case Aiter does not support the model configuration)
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
```

For CUDA backends, the selection process is somewhat more automated: vLLM attempts to detect available backends and select the optimal one automatically if the user does not manually override the choice. This reduces manual configuration for end-users on CUDA systems and provides a better default experience.
However, from a code architecture perspective, the current implementation is fragmented and messy:
There is redundant or inconsistent handling of fallback and error behaviors across backends.
Extending or maintaining the current logic for CUDA (and especially for new backends) is difficult, and introduces risk of subtle bugs or priority handling mistakes.
The user experience for override, error messaging, and fallback is not uniform between CUDA and ROCm.
Therefore, while CUDA users are less affected by manual configuration, the need for a unified, modular, and maintainable selection mechanism is just as strong for CUDA as for ROCm.

# Proposal
Recently, PR #20699 introduced improvements for backend management in vLLM. This PR added:
An IsSupported class/dataclass, which encapsulates whether a backend is supported for a given configuration, as well as detailed reasons and metadata in unsupported cases.
An is_attn_backend_supported method, providing a programmatic, self-documenting, and extensible way to check backend support for arbitrary combinations of parameters on any given backend.

1. Extending IsSupported & is_attn_backend_supported for Unified Selection:
	
- Extend IsSupported to include additional model configuration details, such as : device capability, block size, etc… .
- Abstract backend support checks: Encourage all backend implementations to supply or register their constraints and capability check logic via a standard interface returning IsSupported. 
- Extensible diagnostics: Continue to expose all “unsupported reasons” to users to ensure issues are easy to debug and PR feedback is actionable.

2. Unified Auto-Selection Logic:

  - Implement a modular backend selection method in vLLM that:
  - Detects available hardware and dependencies at runtime.
  - Determines the list of valid/supported backends for the current environment using the is_attn_backend_supported method.
  - Selects the highest-priority backend that is compatible with user and model configuration(similar to the linear kernel selection logic implemented in #7701 and #11785).
  - Provides clear log/warning messages for fallback paths, unsupported cases, and override scenarios.

3. Consistent Environment Variable Handling:

- Allow manual override via the environment variable VLLM_ATTENTION_BACKEND, but provide:
- Immediate validation of requested backend.
- Clear error messages if the backend is unavailable or unsupported with current settings.
- Automatic fallback to auto-selection if the manual override fails.

4. Testing and Debugging:

- Provide unit and integration tests for selection logic across different platforms and configuration scenarios.
- Facilitate backend selection debuggability via clear and actionable logging.

# Implementation Plan

- [ ] Merge/improve #21366 as the initial implementation.
- [ ] Implement selection policy for ROCm (MLA), CUDA, CPU, and additional backends.
- [ ] Update documentation to describe the selection mechanism, environment variables, and manual overrides.
- [ ] Solicit community feedback, from users and maintainers of vLLM.


### Alternatives

1. Continue with per-backend selection logic. Rejected for maintenance complexity and poor user experience.

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC][Feature]: Unified Auto-Selection Mechanism for Attention Backends #21805

🚀 The feature, motivation and pitch

Summary

Motivation

Proposal

Implementation Plan

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC][Feature]: Unified Auto-Selection Mechanism for Attention Backends #21805

Description

🚀 The feature, motivation and pitch

Summary

Motivation

Proposal

Implementation Plan

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions