-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
[V1] Support cross-layer KV sharing #18212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
entrypoints test failure is unrelated and failing on trunk (see https://buildkite.com/vllm/fastcheck/builds/24385) |
@heheda12345 could you take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for my late review. Some points that I want to discuss:
- The name "KV sharing". Do you think "reuse" is a better name? I want to discuss more about it because in #17996, I need to let multiple layers sharing the same memory pool but with different
block_ids
, and I think we need to distinguish between the "sharing" in this PR and that PR. From my understanding,reuse
is more accurate here because layers are not equal. The first layer updates the kv cache and the following layers just reuse the first layer. But open to discussion. We need to agree on a name and keep it consistent in this PR. - One model with kv sharing should use less memory per block than another model with the same model config but without kv sharing. Where do you implement this logic now?
- Is KV sharing compatible with kv connectors now?
- I think we can make KV sharing more implicit. Basically, I think it is possible to avoid changing code inside v1/core & kv_cache_interface.py. kv_cache_manager & kv_cache_utils don’t need to know about kv sharing. They can run as if the layers without kv sharing does not exist. To mimic it, we can only return layers with
kv_sharing_target_layer_idx is None
in GPUModelRunner.get_kv_cache_spec. - I prefer to use
kv_sharing_target_layer_name
thankv_sharing_target_layer_idx
as it has no ambiguity. For example, in bart, we will have bothdecoder.layers.1.self_attn
anddecoder.layers.1.encoder_attn
. Both layer index is 1. - Add check for we only support kv sharing in v1
This pull request has merge conflicts that must be resolved before it can be |
@heheda12345 thanks for taking a look. To answer your questions:
I didn't quite understand why it would be "less memory per block". I think we'll just have less physical KV blocks being used? Here is where the core memory savings would be coming from, by not allocating if there is a target layer for KV sharing. I might be missing some other implementation details here, let's chat offline?
Not at the moment, I believe
I explored this design but I remember the complexity was just offloaded to a later stage as we needed to handle KV allocation for layers without a KV cache spec anyways. But I think the APIs around KV cache groups have changed considerably since then, let me take a look again.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Sure. Let's use sharing. Pls unify the concept in this PR.
- We have less physical memory per KV block, thus we can increase
num_gpu_blocks
. Where is this logic? - What is the blocker for making it compatible with KV connector?
- as we needed to handle KV allocation for layers without a KV cache spec" I think it may be possible to add a function in
initialize_kv_cache
to handle all logic. Basically, that function needs:- pointing the Attention.kv_cache to the target layer like https://github.com/vllm-project/vllm/blob/b0d8b5968d6c2646ca9b43cd1a175adf87d39651/vllm/v1/worker/gpu_model_runner.py#L2004
- adding the shared layer to the kv cache group of its target layer to help this loop https://github.com/vllm-project/vllm/blob/b0d8b5968d6c2646ca9b43cd1a175adf87d39651/vllm/v1/worker/gpu_model_runner.py#L620
But not sure whether I miss any complexity.
5 & 6: SG!
BTW, most merge conflict comes from a temporary revert #18459. I think we can just work on the current branch now without rebase.
updated to address comments.
tried and it still works |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! Really appreciate the detailed tests and input verification. I think this PR is good except some very small items.
This pull request has merge conflicts that must be resolved before it can be |
@heheda12345 addressed comments. could you take another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Only some small comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for your contribution.
Signed-off-by: Yong Hoon Shin <[email protected]>
Signed-off-by: Yong Hoon Shin <[email protected]>
Signed-off-by: Yong Hoon Shin <[email protected]>
Signed-off-by: Yong Hoon Shin <[email protected]>
Signed-off-by: Yong Hoon Shin <[email protected]>
Signed-off-by: Yong Hoon Shin <[email protected]>
Head branch was pushed to by a user without write access
Motivation
Some models like Tencent-Hunyuan-Large (#10043) and Hymba-1.5B-Base (#10783) use cross-layer KV sharing (e.g. Cross-Layer Attention). This PR adds the ability for KV caches to be shared between attention layers.
Design
This PR adds a new argument
kv_sharing_target_layer_name: Optional[str] = None
to theAttention
layer class. This is only supported in V1. To have an Attention layer not allocate its own KV cache and instead share the KV cache with another layer (referred to as the target layer), you can pass in the fully-qualified name of theAttention
layer in the target layer e.g.model.layers.0.attn
. The argkv_sharing_target_layer_name
is only valid if a) it refers to anAttention
layer, b) it has the same attn type (e.g. decoder) as the current layer, and c) it comes before the current layer. It is referred to the as the target layer because during attention, the current layer will use its queries and perform the attention op with the keys and values tensor from the KV cache of the target layer.If an
Attention
layer has a validkv_sharing_target_layer_name
defined, then we skip creating a KVCacheSpec for it, while recording the mapping inself.shared_kv_cache_layers
:https://github.com/vllm-project/vllm/blob/89450fc323e9eee05cbba76fb5b9a0d29f7038d8/vllm/v1/worker/gpu_model_runner.py#L2142-L2152
During KV cache initialization, KV cache management logic will continue as if this layer did not exist and will not allocate a KV cache for the layer. The KV cache for these layers will instead be a reference to the allocated KV caches of the matching target layers, which enables the memory savings of cross-layers KV sharing.
https://github.com/vllm-project/vllm/blob/89450fc323e9eee05cbba76fb5b9a0d29f7038d8/vllm/v1/worker/utils.py#L107-L110
We also add these layers to the list of layer names kept by each KV cache group, as this ensures that each layer is assigned its own attention metadata. From the perspective of the
Attention
layer, it does not know where the key and value caches are coming from.The memory savings of cross-layer KV sharing allows a given amount of memory to accommodate longer context lengths or enable more request to be processed in parallel.
Testing
Sanity Check
As a sanity check that the implementation is working, I made all layers after the 18th layer in Qwen/Qwen3-8B (36 layers total) and printed out the id() of the kv cache used in attention forward:
As expected, layers 19 to 36 are re-using the KV cache allocated by layer 18.
Unit Tests
All newly added unit tests pass:
Evals
checked the score of gsm8k before and after my PR on Qwen/Qwen3-8B:
before PR:
After PR: