[Feature]: Elastic KV memory management based on usage

### 🚀 The feature, motivation and pitch

Hello there, I'm considering whether it is possible to dynamically adjust runtime KV Cache memory usage based on the current system conditions and workload demand. For example, here are two scenarios:

1) When users do not need to perform inference for a while and GPU/CPU memory usage is low, current serving frameworks can free up memory for other tasks (e.g., other AI workloads.).

2) When users need to handle many concurrent requests during runtime, the serving framework increase GPU memory utilization without a restart.

The problem is that I notice that mainstream serving frameworks, such as vLLM, SGLang and TGI typically pre-allocate a fixed percentage of GPU memory before server startup. (in vLLM it is `gpu_memory_utilization`). That means the pre-allocated GPU memory cannot be adjusted without a restart.

I absolutely feel that this is a challenge for vLLM that depends on PagedAttention. As far as I understand, sequences are mapped into paged KV memory blocks. In a low-usage scenario (like case 1), if we want to free up memory, we would need to identify which KV memory blocks to evict and determine an appropriate eviction policy or ratio. However, it’s not clear how to accurately correlate "current demand" with the "actual KV memory blocks needed", particularly in PagedAttention-based frameworks.

In a toy continuous batching framework I built using a simple StaticCache (adapted to support continuous batching) mechanism, I implemented a rudimentary dynamic memory adjustment strategy. While it's far from production-ready, it suggests the potential for this approach. I'm considering whether there are some heuristic ways to correlate "current demand" with the "actual KV memory blocks needed".

Please let me know your thoughts, or feel free to correct me if I’ve misunderstood any internals. I'm happy to contribute or collaborate on any discussion around this idea.

### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Elastic KV memory management based on usage #18125

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Elastic KV memory management based on usage #18125

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions