-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Description
Motivation.
Offloading device KV cache to the CPU can be worthwhile if the transfer overhead outweighs the re-computation, saving precious GPU cycles. This is especially useful in cases such as long, multi-turn conversations. Additionally, hardware improvements such as Nvidia C2C greatly accelerate CPU-GPU communication, making offloading even more compelling.
Proposed Change.
Design
The design space for KV cache offloading is broad. As an initial goal, we propose focusing primarily on offloading to the CPU. While we aim to keep the interface and implementation extensible—enabling future support for offloading to other mediums such as disk or remote storage—these are out of scope for this RFC.
A key design consideration is determining when to swap KV cache blocks out to the CPU and when to swap them back into the device.
For swap-out, the earliest opportunity is immediately after a KV cache block is generated, while the latest is just before it is evicted from the device. For swap-in, the earliest timing can be guided by a prefetching policy, while the latest is just before the next forward() call.
In this RFC, we propose a lazy swap-in/swap-out approach that runs after each scheduling step. Optimizations such as eager eviction, prefetching, or even layer-wise transfers can be added independently in the future to improve performance.
Specifically, during each scheduling step, the KV cache manager will accumulate swap-in and swap-out decisions for each request, and generate a swap plan at the end of the step. This swap plan becomes part of the scheduler output and is executed by the model runner prior to the model forward.
Interface
The KV cache manager will continue to manage all the metadata with roughly the same API with minor changes:
get_computed_blocks
: Additionally returns the set of KV blocks currently cached on the CPU.allocate_slots
: Additionally allocates new device blocks to host CPU blocks that are scheduled for swap-in (i.e. the cache hit CPU blocks returned byget_computed_blocks
).end_schedule_step
: A new hook called at the end of a scheduler step, it saves the full “swap plan” to the scheduler output. This simplifies the code by avoiding the need to thread scheduler state through the KV cache manager internals.
BlockPool
Inside KV cache manager, we refactor and derive from the BlockPool class, allowing for tier-specific implementations.
Abstract Base Class: BlockPool
has the following methods:
get_num_free_blocks()
get_usage()
get_cached_block(self, block_hash: BlockHashType) -> Optional[KVCacheBlock]
get_new_blocks(self, num_blocks: int) -> list[KVCacheBlock]
_maybe_evict_cached_block(self, block: KVCacheBlock) -> bool
:- to support eviction to next tier storage, this can take in another block pool to shelter the evicted block
- (new)
_maybe_shelter_evicted_block
:- Optionally used by a lower-tier block pool to shelter blocks evicted from an upper tier.
GpuBlockPool(BlockPool)
: Contains the current logic used for managing GPU memory.
CpuBlockPool(BlockPool)
: A new implementation to manage CPU-side KV cache blocks.
User-Facing Configuration
We can repurpose the existing --swap-space flag (previously unused in V1) to control the number of CPU cache blocks. However, the current default of 4GB may need to be re-evaluated.
Performance
In the initial version, we will try to hide the transfer latency with async transfer (i.e. pinned memory, MemcpyAsync and/or separate streams). On the CPU eviction policy, we will use round-robin for simplicity first, LRU will be added next.
In the future, we could employ more sophisticated techniques such as prefetch, eager swap-out and layer-wise transfer (implementation can probably be shared with disaggregation) to further hide the transfer latency.
Plan
- Refactor
BlockPool
- Add CpuBlockPool and the rest of KV cache manager logic to generate "swap-plan" for each scheduler step
- Rest of the logic that consumes and the "swap-plan" and execute the swaps
- benchmark and low hanging fruit optimization
Feedback Period.
one week
CC List.
@comaniac @WoosukKwon @zhuohan123 @simon-mo
Any Other Things.
An initial prototype is implemented in #13377
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.