-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[V1][Core] Support offloading KV cache to CPU. #13377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
Is V0 Support? |
8d7835f
to
cc4a3e2
Compare
Hi @mengzhu28, thanks for submitting the great PR! I will reach out to you offline. |
391e3cf
to
f67acf9
Compare
Signed-off-by: Meng Zhu <[email protected]>
Signed-off-by: Meng Zhu <[email protected]>
Signed-off-by: Meng Zhu <[email protected]>
Signed-off-by: Meng Zhu <[email protected]>
Signed-off-by: Meng Zhu <[email protected]>
Signed-off-by: Meng Zhu <[email protected]>
ac29237
to
78b59f2
Compare
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mengzhu28 Thanks for this great work on V1, offloading KV cache to CPU
can gains performance to TTFT and Throughput, just thinking about the next further step base on this PR, may be vllm can support offloading KV cache to Disk
as followup work?
I left a comment inline about abstraction, please take a look, thanks.
# The following swap maps are accumulated over a scheduling step. | ||
# Then they are "flushed" as part of the scheduler output. | ||
# GPU block ID -> CPU block ID | ||
self.step_d2h_swap_map: Dict[int, int] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please do an abstract to support Offload to Disk
in the future? If we did this abstraction, the data structure can be [ [src_device, dst_device] -> swap_map[src_block_id -> dst_block_id] ]. Any throughs?
kv_caches: Dict[str, torch.Tensor], | ||
forward_context: Dict[str, "Attention"], | ||
runner_kv_caches: List[torch.Tensor], | ||
forward_context: Dict[str, "Attention"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change these parameters sequences can make more sense but in the other hand, it introduce more extra code changes.
@mengzhu28 Could you please rebase the PR? |
@WoosukKwon as discussed offline, created RFC #16144 |
Would it be better to abstract the CPU offloading related functions into a new class and add a parameter to enable it? |
Signed-off-by: Meng Zhu <[email protected]>
num_computed_tokens -= self.block_size | ||
num_new_tokens = self.block_size | ||
computed_blocks.pop() | ||
if computed_blocks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GPU hit must be before the CPU, so here we should first try pop()
from the computed_cpu_blocks
.
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
helo,can this support 1-cpu/n-gpu in one host situation? |
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
TL;DR
In V1, swap GPU KV cache blocks to CPU upon eviction and swap them back if there's a cache hit.
Swap Strategy
CPU → GPU swap-in happens naturally when requests hit the cache (unless we do prefetching).
GPU → CPU swap-out can be handled in two ways:
This PR adopts (2) to minimize unnecessary swaps. However, the downside is that the swap-out overhead might be exposed.
Ideally, an optimal approach would asynchronously offload X cache blocks at a certain cadence (e.g., hidden behind the main CUDA graph) while maintaining free GPU block headroom. This would add complexity and is left for future work.
Implementation
This PR builds on the excellent V1 KV cache manager, blend in with the existing interface.
Newly introduced metadata states:
cpu_block_pool
andcached_block_hash_to_cpu_block
mirror their GPU counterparts.High-Level Flow:
For simplicity, we avoid threading the scheduler output through multiple KV cache manager calls. Instead, swap-related data is accumulated in
step_*
fields (e.g.,step_h2d_swap_map
).A new
end_schedule_step
callback resets them at the end of each scheduling iteration. (Open to alternative designs.)CPU Cache Eviction Policy
We currently adopt a simple round-robin strategy to do CPU cache eviction. LRU will be added in a followup PR.
User Configuration:
We reuse the existing
--swap-space
flag (previously unused in V1) to control the number of CPU blocks.Whether to change the default (currently 4GB) remains up for discussion.
Benchmark
TBA
TODO