-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[Attention] Refactor AttentionMetadata Preparation for Encoder-only Models #23154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
LucasWilkinson
merged 14 commits into
vllm-project:main
from
heheda12345:encoder_refactor
Aug 22, 2025
Merged
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
8d7009b
encoder refactor
heheda12345 c86b4b7
update type hint
heheda12345 c77560e
rename
heheda12345 3712114
fix bug
heheda12345 e806925
Merge branch 'main' of github.com:vllm-project/vllm into encoder_refa…
heheda12345 0213df5
use patch build
heheda12345 2753684
Merge branch 'main' of github.com:vllm-project/vllm into encoder_refa…
heheda12345 43c3557
fix
heheda12345 2243510
Merge branch 'main' into encoder_refactor
heheda12345 f05b2dc
fix
heheda12345 efc68df
Merge branch 'encoder_refactor' of github.com:heheda12345/vllm into e…
heheda12345 92e26d2
type:ignore
heheda12345 0b0d80e
fix test
heheda12345 bb76606
Merge branch 'main' of github.com:vllm-project/vllm into encoder_refa…
heheda12345 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,86 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
| import functools | ||
| from copy import copy | ||
| from typing import Optional | ||
|
|
||
| import torch | ||
| from transformers import CacheConfig | ||
|
|
||
| from vllm import envs | ||
| from vllm.attention.backends.abstract import (AttentionBackend, | ||
| AttentionMetadata, AttentionType) | ||
| from vllm.attention.layer import Attention | ||
| from vllm.attention.selector import get_attn_backend | ||
| from vllm.v1.attention.backends.utils import (CommonAttentionMetadata, | ||
| subclass_attention_backend) | ||
|
|
||
|
|
||
| @functools.lru_cache | ||
| def create_encoder_only_attention_backend( | ||
| underlying_attn_backend: AttentionBackend, ) -> type[AttentionBackend]: | ||
| prefix = "EncoderOnlyAttention_" | ||
| underlying_builder = underlying_attn_backend.get_builder_cls() | ||
|
|
||
| class EncoderOnlyAttentionBuilder(underlying_builder): # type: ignore | ||
|
|
||
| def build(self, | ||
| common_prefix_len: int, | ||
| common_attn_metadata: CommonAttentionMetadata, | ||
| fast_build: bool = False) -> AttentionMetadata: | ||
| new_common_attn_metadata = copy(common_attn_metadata) | ||
| new_common_attn_metadata.causal = False | ||
| return super().build(common_prefix_len, new_common_attn_metadata, | ||
| fast_build) | ||
|
|
||
| attn_backend = subclass_attention_backend( | ||
| name_prefix=prefix, | ||
| attention_backend_cls=underlying_attn_backend, | ||
| builder_cls=EncoderOnlyAttentionBuilder) | ||
|
|
||
| return attn_backend | ||
|
|
||
|
|
||
| class EncoderOnlyAttention(Attention): | ||
| """ | ||
| Encoder attention is a special case that doesn't need a KV Cache. | ||
| """ | ||
|
|
||
| def __init__(self, | ||
| num_heads: int, | ||
| head_size: int, | ||
| scale: float, | ||
| cache_config: Optional[CacheConfig] = None, | ||
| attn_type: Optional[str] = None, | ||
| **kwargs): | ||
| dtype = torch.get_default_dtype() | ||
|
|
||
| if cache_config is not None: | ||
| kv_cache_dtype = cache_config.cache_dtype | ||
| block_size = cache_config.block_size | ||
| else: | ||
| kv_cache_dtype = "auto" | ||
| block_size = 16 | ||
|
|
||
| if envs.VLLM_USE_V1: | ||
| underlying_attn_backend = get_attn_backend(head_size, dtype, | ||
| kv_cache_dtype, | ||
| block_size) | ||
|
|
||
| attn_backend = create_encoder_only_attention_backend( | ||
| underlying_attn_backend) | ||
| else: | ||
| # in v0 encoder only attention is handled inside the backends | ||
| attn_backend = None | ||
|
|
||
| if attn_type is not None: | ||
| assert attn_type == AttentionType.ENCODER_ONLY, \ | ||
| "EncoderOnlyAttention only supports AttentionType.ENCODER_ONLY" | ||
|
|
||
| super().__init__(num_heads=num_heads, | ||
| head_size=head_size, | ||
| scale=scale, | ||
| cache_config=cache_config, | ||
| attn_backend=attn_backend, | ||
| attn_type=AttentionType.ENCODER_ONLY, | ||
| **kwargs) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned earlier, any model that uses a decoder-only LLM can be converted into encoder-only Attention using an unsupervised method. (Very easy to use, the improvement is significant. so over time, an increasing number of models need to add this line of code
Do we really need to add EncoderOnlyAttention
These two aspects maybe need this PR to take care, maybe not. Sorry for confusing you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But during serving, should it always be either decoder or encoder-only? To make a model support both encoder_only mode and decoder mode, you can see what I did on llama and qwen in this PR.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
over time, an increasing number of models need to add this line of code,
As well as EncoderOnlyAttention and Attention interfaces should be exactly the same, then why do we need to using EncoderOnlyAttention
(My point is that the EncoderOnlyAttention functionality should become part of Attention, and it can be activated by using attn_type == AttentionType.ENCODER_ONLY. This way, we only need a single Attention interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@noooop Even if we keep the attention interfaces the same the model definitions would need to be updated to include
vllm/vllm/model_executor/models/qwen3.py
Lines 184 to 187 in 7be5d11
@noooop the context is that we are overhauling alot of the different attention layers in vLLM to make them more pluggable and backend-agnostic, as well as move away from bloating the Attention class, attention backends and/or gpu-model-runner with all the different schemes (source of merge conflicts and technical debt). For this reason we are moving to more specific attention subclasses instead of flags in attention, example #21588 moves from using a
use_iropeflag onAttentionto aChunkedLocalAttentionlayer.With that being said since we do have 3 models already (qwen2, qwen3 and llama) that have this dual decoder-only or encoder-only support and may more come, so I could see how in this specific case it could make sense to roll it into the
Attentionclass. I think this would be one of the few exceptions to our general preference for attention layer subclasses though. @heheda12345 I think this would be ok; but as the author I'll ultimately leave the decision up to you. I agree with you that decoder-only models are the priority for vLLM.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After careful consideration, introducing EncoderOnlyAttention does indeed have some advantages, and I am satisfied with this modification.
vllm has too many Jump wires, reducing one attn_type Jump wire is always good.
Thank you for your refactoring.