-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Description
🚀 The feature, motivation and pitch
Hello there, I'm considering whether it is possible to dynamically adjust runtime KV Cache memory usage based on the current system conditions and workload demand. For example, here are two scenarios:
-
When users do not need to perform inference for a while and GPU/CPU memory usage is low, current serving frameworks can free up memory for other tasks (e.g., other AI workloads.).
-
When users need to handle many concurrent requests during runtime, the serving framework increase GPU memory utilization without a restart.
The problem is that I notice that mainstream serving frameworks, such as vLLM, SGLang and TGI typically pre-allocate a fixed percentage of GPU memory before server startup. (in vLLM it is gpu_memory_utilization
). That means the pre-allocated GPU memory cannot be adjusted without a restart.
I absolutely feel that this is a challenge for vLLM that depends on PagedAttention. As far as I understand, sequences are mapped into paged KV memory blocks. In a low-usage scenario (like case 1), if we want to free up memory, we would need to identify which KV memory blocks to evict and determine an appropriate eviction policy or ratio. However, it’s not clear how to accurately correlate "current demand" with the "actual KV memory blocks needed", particularly in PagedAttention-based frameworks.
In a toy continuous batching framework I built using a simple StaticCache (adapted to support continuous batching) mechanism, I implemented a rudimentary dynamic memory adjustment strategy. While it's far from production-ready, it suggests the potential for this approach. I'm considering whether there are some heuristic ways to correlate "current demand" with the "actual KV memory blocks needed".
Please let me know your thoughts, or feel free to correct me if I’ve misunderstood any internals. I'm happy to contribute or collaborate on any discussion around this idea.
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.