Skip to content

[RFC][Feature]: Unified Auto-Selection Mechanism for Attention Backends #21805

@vllmellm

Description

@vllmellm

🚀 The feature, motivation and pitch

Summary

This RFC proposes introducing a robust, unified mechanism for automatic selection of attention backends in vLLM, initially focused on ROCm, with a roadmap for extending support to all other major backends. The primary goal is to simplify user experience and ensure optimal backend selection based on hardware, available dependencies, and configuration, while retaining the ability for manual override when needed.

Motivation

Manually configuring and selecting the most suitable attention backend has been a significant pain point for ROCm users. The “best performing” attention backend, AiterFlashAttention, requires explicit activation via the environment variable VLLM_ROCM_USE_AITER (False by default). While this was necessary before as the AiterFlashAttentionBackend implementation was “experimental”; However, the AiterFlashAttentionBackend has now been validated and tested on many models and achieves better performance over the Triton implementation. On another note, the TritonAttentionBackend supports two different modes, a “unified attention” mode and a “split prefill decode” mode, with the former being the default backend despite the latter being faster. This results in an awkward selection logic for attention backends on ROCm where vLLM defaults to the slowest backend and must opt-in to use the fastest backend.

# To use the fastest backend
VLLM_ROCM_USE_AITER=1
# To use the second fastest backend (in case Aiter does not support the model configuration)
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1

For CUDA backends, the selection process is somewhat more automated: vLLM attempts to detect available backends and select the optimal one automatically if the user does not manually override the choice. This reduces manual configuration for end-users on CUDA systems and provides a better default experience.
However, from a code architecture perspective, the current implementation is fragmented and messy:
There is redundant or inconsistent handling of fallback and error behaviors across backends.
Extending or maintaining the current logic for CUDA (and especially for new backends) is difficult, and introduces risk of subtle bugs or priority handling mistakes.
The user experience for override, error messaging, and fallback is not uniform between CUDA and ROCm.
Therefore, while CUDA users are less affected by manual configuration, the need for a unified, modular, and maintainable selection mechanism is just as strong for CUDA as for ROCm.

Proposal

Recently, PR #20699 introduced improvements for backend management in vLLM. This PR added:
An IsSupported class/dataclass, which encapsulates whether a backend is supported for a given configuration, as well as detailed reasons and metadata in unsupported cases.
An is_attn_backend_supported method, providing a programmatic, self-documenting, and extensible way to check backend support for arbitrary combinations of parameters on any given backend.

  1. Extending IsSupported & is_attn_backend_supported for Unified Selection:
  • Extend IsSupported to include additional model configuration details, such as : device capability, block size, etc… .
  • Abstract backend support checks: Encourage all backend implementations to supply or register their constraints and capability check logic via a standard interface returning IsSupported.
  • Extensible diagnostics: Continue to expose all “unsupported reasons” to users to ensure issues are easy to debug and PR feedback is actionable.
  1. Unified Auto-Selection Logic:
  • Implement a modular backend selection method in vLLM that:
  • Detects available hardware and dependencies at runtime.
  • Determines the list of valid/supported backends for the current environment using the is_attn_backend_supported method.
  • Selects the highest-priority backend that is compatible with user and model configuration(similar to the linear kernel selection logic implemented in [Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin #7701 and [TPU][Quantization] TPU W8A8 #11785).
  • Provides clear log/warning messages for fallback paths, unsupported cases, and override scenarios.
  1. Consistent Environment Variable Handling:
  • Allow manual override via the environment variable VLLM_ATTENTION_BACKEND, but provide:
  • Immediate validation of requested backend.
  • Clear error messages if the backend is unavailable or unsupported with current settings.
  • Automatic fallback to auto-selection if the manual override fails.
  1. Testing and Debugging:
  • Provide unit and integration tests for selection logic across different platforms and configuration scenarios.
  • Facilitate backend selection debuggability via clear and actionable logging.

Implementation Plan

  • Merge/improve [ROCm] Auto-Select Attention Backend #21366 as the initial implementation.
  • Implement selection policy for ROCm (MLA), CUDA, CPU, and additional backends.
  • Update documentation to describe the selection mechanism, environment variables, and manual overrides.
  • Solicit community feedback, from users and maintainers of vLLM.

Alternatives

  1. Continue with per-backend selection logic. Rejected for maintenance complexity and poor user experience.

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions