Skip to content

[Feature]: Elastic KV memory management based on usage #18125

@Spycsh

Description

@Spycsh

🚀 The feature, motivation and pitch

Hello there, I'm considering whether it is possible to dynamically adjust runtime KV Cache memory usage based on the current system conditions and workload demand. For example, here are two scenarios:

  1. When users do not need to perform inference for a while and GPU/CPU memory usage is low, current serving frameworks can free up memory for other tasks (e.g., other AI workloads.).

  2. When users need to handle many concurrent requests during runtime, the serving framework increase GPU memory utilization without a restart.

The problem is that I notice that mainstream serving frameworks, such as vLLM, SGLang and TGI typically pre-allocate a fixed percentage of GPU memory before server startup. (in vLLM it is gpu_memory_utilization). That means the pre-allocated GPU memory cannot be adjusted without a restart.

I absolutely feel that this is a challenge for vLLM that depends on PagedAttention. As far as I understand, sequences are mapped into paged KV memory blocks. In a low-usage scenario (like case 1), if we want to free up memory, we would need to identify which KV memory blocks to evict and determine an appropriate eviction policy or ratio. However, it’s not clear how to accurately correlate "current demand" with the "actual KV memory blocks needed", particularly in PagedAttention-based frameworks.

In a toy continuous batching framework I built using a simple StaticCache (adapted to support continuous batching) mechanism, I implemented a rudimentary dynamic memory adjustment strategy. While it's far from production-ready, it suggests the potential for this approach. I'm considering whether there are some heuristic ways to correlate "current demand" with the "actual KV memory blocks needed".

Please let me know your thoughts, or feel free to correct me if I’ve misunderstood any internals. I'm happy to contribute or collaborate on any discussion around this idea.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestNew feature or requeststaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions