[RFC]: Enable Memory Tiering for vLLM

### Motivation.

Nowadays, many new applications including multi-turn conversations, multi-modality and multi-agent, require a significant amount of KV cache. Such applications generally have a shared prompt for multiple requests, and recomputing them each time can take significant time for prefilling. Suppose the length of shared prompt is $n$, the complexity of recomputation is $O(n^2)$. On the other hand, if the shared prompt is saved on the slower but larger-capacity storage hierarchy (such as DRAM or even Disk), the time to load the prompt KV cache and compute the next token is $O(n)$ (assuming a full hit). So at some point, evicting and reloading the KV cache from storage hierarchy becomes beneficial than recomputing the whole prompt again. Such observation is also found in papers [1], [2]

However, the current vLLM rarely uses the secondary storage tier (DRAM). It only swaps out the running sequences when they fail to allocate blocks for the newly generated token which rarely happens for different workloads (what I have tested included sharedGPT, UltraChat, LooGle, Toolbench). When vLLM allocates blocks for prefill sequences, it discards the content of the GPU blocks in the evictor. Instead, it can evict the blocks to the CPU DRAM so that a new request that shares the same prompt can load it without recomputing again.

I have some motivation numbers to support the RFC. First is the cost of recomputation against that of swap-in and compute. Here I use tp=1, pp=1 on 1 A100. The model is togethercomputer/Llama-2-7B-32K-Instruct. The x-axis is the token length. The Y-axis is the TTFT for 1 single sequence (No queueing). In the figure, *prefill w/o cache* means recompute. *prefill w cache* means the kv cache is in HBM and we compute for the very next token. *prefill w cache + swapin* means the kv cache is in DRAM and we first swap in the cache from DRAM to HBM and then compute the very next token. *swap-in* is the swap in time for the previous case. Here I did not enable any overlapping between swap-in and execution.
![TTFT](https://github.com/user-attachments/assets/9e6cc09b-7b1e-4320-b0a4-42dce72f22a4)
As in the figure, the recompute time increases quadratically and the latency is higher than *prefill w cache + swapin* from 1024 tokens.

I have also implemented a prototype for this RFC, enabling the eviction of blocks to DRAM from HBM when prefix-caching is set.  In the next figure, I tested `longdep_qa.json` in the LooGLE benchmark[3]. The dataset's average prompt length is about 15k. I used the same setup as the previous experiment and set the request rate to 1000. Here, the x-axis is the DRAM size, and the y-axis is either the mean TTFT or TPOT.
![LooGLE TTFT (1)](https://github.com/user-attachments/assets/0e976020-41d4-4078-a7ec-1e7f47456731)
![LooGLE TPOT](https://github.com/user-attachments/assets/141bf214-c04d-4b97-aca2-db4ca0ca5af2)


There are still some variances in the benchmark, which I am looking into. Another issue is that it seems the performance decreases as the DRAM is larger than a threshold. I am still investigating what happened. The performance decrease at the beginning is expected because of the pure cost of data movement.

**References**
[1] Gao, Bin, et al. "{Cost-Efficient} Large Language Model Serving for Multi-turn Conversations with {CachedAttention}." 2024 USENIX Annual Technical Conference (USENIX ATC 24). 2024.
[2] Sheng, Ying, et al. "Flexgen: High-throughput generative inference of large language models with a single gpu." International Conference on Machine Learning (ICML 2023) 2023.
[3] Li, Jiaqi, et al. "LooGLE: Can Long-Context Language Models Understand Long Contexts?." arXiv preprint arXiv:2311.04939 (2023).

### Proposed Change.

This patch takes three steps in my mind:
1. Implement the basic functionality to enable eviction and promotion to and from the DRAM
2. Enable the overlapping of KV cache transfer with the model execution (instead of finishing the transmission and then starting execution, transmission and execution are done layer-by-layer)
3. Enabling selective eviction to DRAM based on prompt length.

I have done the prototype for 1 which does not require too much change. The necessary changes include:
1. Change the `block_manager_v[1,2].py` to allocate CPU blocks in eviction and query the `cpu_allocator` for a block hash
2. Change the `scheduler.py` to add `blocks_to_swap_in` `blocks_to_swap_out` for `_scheduler_prefill`
3. Change the order of swap_out and swap_in in `worker.py` 

The following are the things left for stage 1:
1. Extensive testing. Right now I only wrote tests for a very small case and checked it worked for a real benchmark (match the input output)
2. Add a free timestamp for CPU blocks because the `access_blocks_for_seqs` is never invoked for CPU blocks. This leads to some undesired behavior in eviction policy
3. Check what to do when one block is dropped from both CPU and GPU memory in the middle of a prompt with both tokens before and after cached (the oooooxxooooxx pattern). We probably need to change the kernel to support this case. I saw such a case happen in testing.
4. Support for encoder, decoder model
5. Support for `block_manager_v2`
6. Check what happened when chuncked_prefill enabled

Some changes I deemed necessary for stage 2:
1. Change the model to support layer-by-layer transmission

Some changes I deemed necessary for stage 3:
1. Add a dry run to find the tipping point.

### Feedback Period.

2-3 weeks

### CC List.

@zhuohan123 @robertgshaw2-neuralmagic @zcnrex @sh1ng @SageMoore @comaniac @youkaichao @andoorve 

### Any Other Things.

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Enable Memory Tiering for vLLM #7697

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Enable Memory Tiering for vLLM #7697

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions