[V1][Core] Support offloading KV cache to CPU. #13377

mengzhu28 · 2025-02-17T06:18:48Z

TL;DR

In V1, swap GPU KV cache blocks to CPU upon eviction and swap them back if there's a cache hit.

Swap Strategy

CPU → GPU swap-in happens naturally when requests hit the cache (unless we do prefetching).
GPU → CPU swap-out can be handled in two ways:

Eagerly: Immediately after a request completes and its blocks are freed.
Lazily: When evicting a GPU block while scheduling new requests.

This PR adopts (2) to minimize unnecessary swaps. However, the downside is that the swap-out overhead might be exposed.

Ideally, an optimal approach would asynchronously offload X cache blocks at a certain cadence (e.g., hidden behind the main CUDA graph) while maintaining free GPU block headroom. This would add complexity and is left for future work.

Implementation

This PR builds on the excellent V1 KV cache manager, blend in with the existing interface.
Newly introduced metadata states:

cpu_block_pool and cached_block_hash_to_cpu_block mirror their GPU counterparts.

High-Level Flow:

The KV cache manager accumulates swap-in/out decisions during each scheduling cycle.
These swap decisions are then "flushed" to the scheduler output, allowing model runners to issue aggregated swap calls before model execution, minimizing dispatch overhead.

For simplicity, we avoid threading the scheduler output through multiple KV cache manager calls. Instead, swap-related data is accumulated in step_* fields (e.g., step_h2d_swap_map).
A new end_schedule_step callback resets them at the end of each scheduling iteration. (Open to alternative designs.)

CPU Cache Eviction Policy

We currently adopt a simple round-robin strategy to do CPU cache eviction. LRU will be added in a followup PR.

User Configuration:

We reuse the existing --swap-space flag (previously unused in V1) to control the number of CPU blocks.
Whether to change the default (currently 4GB) remains up for discussion.

Benchmark

TBA

TODO

write tests
benchmarks and profiling
docs

github-actions · 2025-02-17T06:19:00Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-02-17T06:19:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mengzhu28.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

BigCousin-z · 2025-02-17T12:32:05Z

Is V0 Support？

WoosukKwon · 2025-02-19T20:41:36Z

Hi @mengzhu28, thanks for submitting the great PR! I will reach out to you offline.

Signed-off-by: Meng Zhu <[email protected]>

mergify · 2025-03-15T05:12:28Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mengzhu28.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/v1/core/scheduler.py

maobaolong

@mengzhu28 Thanks for this great work on V1, offloading KV cache to CPU can gains performance to TTFT and Throughput, just thinking about the next further step base on this PR, may be vllm can support offloading KV cache to Disk as followup work?

I left a comment inline about abstraction, please take a look, thanks.

maobaolong · 2025-03-20T01:18:31Z

vllm/v1/core/kv_cache_manager.py

+        # The following swap maps are accumulated over a scheduling step.
+        # Then they are "flushed" as part of the scheduler output.
+        # GPU block ID -> CPU block ID
+        self.step_d2h_swap_map: Dict[int, int] = {}


Could you please do an abstract to support Offload to Disk in the future? If we did this abstraction, the data structure can be [ [src_device, dst_device] -> swap_map[src_block_id -> dst_block_id] ]. Any throughs?

maobaolong · 2025-03-21T03:46:06Z

vllm/v1/utils.py

    kv_caches: Dict[str, torch.Tensor],
-    forward_context: Dict[str, "Attention"],
    runner_kv_caches: List[torch.Tensor],
+    forward_context: Dict[str, "Attention"],


change these parameters sequences can make more sense but in the other hand, it introduce more extra code changes.

WoosukKwon · 2025-04-04T20:33:09Z

@mengzhu28 Could you please rebase the PR?

mengzhu28 · 2025-04-07T02:20:44Z

@WoosukKwon as discussed offline, created RFC #16144

chunxiaozheng · 2025-04-07T03:19:17Z

Would it be better to abstract the CPU offloading related functions into a new class and add a parameter to enable it?

Signed-off-by: Meng Zhu <[email protected]>

chunxiaozheng · 2025-04-09T08:28:19Z

vllm/v1/core/scheduler.py

                    num_computed_tokens -= self.block_size
                    num_new_tokens = self.block_size
-                    computed_blocks.pop()
+                    if computed_blocks:


The GPU hit must be before the CPU, so here we should first try pop() from the computed_cpu_blocks .

github-actions · 2025-07-09T02:15:57Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

jiawei-liang · 2025-07-10T06:11:12Z

helo,can this support 1-cpu/n-gpu in one host situation?

github-actions · 2025-10-10T02:06:46Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

mergify bot added v1 needs-rebase labels Feb 17, 2025

WoosukKwon self-assigned this Feb 18, 2025

mengzhu28 force-pushed the mzhu/cpu_offload branch 2 times, most recently from 8d7835f to cc4a3e2 Compare February 19, 2025 01:16

mergify bot removed the needs-rebase label Feb 19, 2025

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 19, 2025

ywang96 marked this pull request as ready for review February 19, 2025 20:30

ywang96 requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners February 19, 2025 20:30

mengzhu28 force-pushed the mzhu/cpu_offload branch from 391e3cf to f67acf9 Compare February 19, 2025 23:34

mengzhu28 added 6 commits February 25, 2025 22:26

[V1][Core] Support offloading KV cache to CPU.

d199de2

Signed-off-by: Meng Zhu <[email protected]>

Fix test.

e8e2bc0

Signed-off-by: Meng Zhu <[email protected]>

rebase kinks.

66d2066

Signed-off-by: Meng Zhu <[email protected]>

Fix d2h h2d order.

c213ddd

Signed-off-by: Meng Zhu <[email protected]>

more new field test fix.

98889a5

Signed-off-by: Meng Zhu <[email protected]>

bind_kv_cache field reorder

78b59f2

Signed-off-by: Meng Zhu <[email protected]>

mengzhu28 force-pushed the mzhu/cpu_offload branch from ac29237 to 78b59f2 Compare February 25, 2025 22:55

mergify bot added the needs-rebase label Mar 15, 2025

DearPlanet reviewed Mar 17, 2025

View reviewed changes

vllm/v1/core/scheduler.py Show resolved Hide resolved

maobaolong mentioned this pull request Mar 20, 2025

[FEAT]Support reset prefix cache by specified device #15003

Merged

maobaolong reviewed Mar 20, 2025

View reviewed changes

maobaolong reviewed Mar 21, 2025

View reviewed changes

mergify bot added the tpu Related to Google TPUs label Mar 27, 2025

DearPlanet mentioned this pull request Mar 27, 2025

[Bugfix] Correct KV cache tensor dimension handling in FlashInfer backend's block operations #15603

Open

kyet mentioned this pull request Apr 3, 2025

[Core] Block Allocator to support KV cache CPU offloading #11532

Closed

mengzhu28 mentioned this pull request Apr 7, 2025

[RFC]: Offload KV cache to CPU in V1 #16144

Closed

1 task

fix copy

b0d059c

Signed-off-by: Meng Zhu <[email protected]>

zeroorhero mentioned this pull request Apr 7, 2025

[V1][Core] Add async kv cache offload #16159

Closed

chunxiaozheng reviewed Apr 9, 2025

View reviewed changes

josephrocca mentioned this pull request Jun 15, 2025

[Roadmap] vLLM Roadmap Q2 2025 #15735

Closed

66 tasks

orozery mentioned this pull request Jun 19, 2025

[RFC]: KV cache offloading #19854

Open

1 task

github-actions bot added the stale Over 90 days of inactivity label Jul 9, 2025

github-actions bot added unstale Recieved activity after being labelled stale and removed stale Over 90 days of inactivity labels Jul 11, 2025

github-actions bot added stale Over 90 days of inactivity and removed unstale Recieved activity after being labelled stale labels Oct 10, 2025

Uh oh!

[V1][Core] Support offloading KV cache to CPU. #13377

Are you sure you want to change the base?

[V1][Core] Support offloading KV cache to CPU. #13377

Conversation

mengzhu28 commented Feb 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Swap Strategy

Implementation

High-Level Flow:

CPU Cache Eviction Policy

User Configuration:

Benchmark

TODO

Uh oh!

github-actions bot commented Feb 17, 2025

Uh oh!

mergify bot commented Feb 17, 2025

Uh oh!

BigCousin-z commented Feb 17, 2025

Uh oh!

WoosukKwon commented Feb 19, 2025

Uh oh!

mergify bot commented Mar 15, 2025

Uh oh!

Uh oh!

maobaolong left a comment

Choose a reason for hiding this comment

Uh oh!

maobaolong Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

maobaolong Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

WoosukKwon commented Apr 4, 2025

Uh oh!

mengzhu28 commented Apr 7, 2025

Uh oh!

chunxiaozheng commented Apr 7, 2025

Uh oh!

chunxiaozheng Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 9, 2025

Uh oh!

jiawei-liang commented Jul 10, 2025

Uh oh!

github-actions bot commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

mengzhu28 commented Feb 17, 2025 •

edited by github-actions bot

Loading