Tutorial: KV cache reuse with llama-server #13606
Replies: 4 comments 11 replies
-
This is great! I have one concern however - the Warning The |
Beta Was this translation helpful? Give feedback.
-
If you provide repro steps about this problem, we can take a look and fix it. |
Beta Was this translation helpful? Give feedback.
-
can we cache context in http client?the tutorial for cache prompt |
Beta Was this translation helpful? Give feedback.
-
My issue with this feature is that if you have 10 slots and 11 cyclic different prompts, it will never reuse cache if I understood how it works correctly. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This tutorial demonstrates how to use the slots management feature in
llama-server
to optimize repeated prompt processing through KV cache reuse.Server Setup
Start the server with the desired number of slots:
Key parameters:
-c 1024 -np 2
: Creates 2 processing slots with(1024/2) = 512
context tokens eachUpon startup, the logs should indicate the initialization of two slots:
By default,
llama-server
attempts to assign slot to a new request based on prompt similarity. The-sps
parameter controls this behavior. A value of 0.5 (the default) means a slot is considered a match if at least 50% of the prompt context matches.Automatic slot selection can be disabled by setting
-sps 0.0
. The slot to use can be explicitly specified in the request using the"id_slot"
parameter in the request body.Prompt caching
Basic Request Example
Here's a sample curl request:
The parameter
cache_prompt: true
(default) instructs the server to cache the prompt. While redundant in this case, it's good practice to be explicit.Server Logs & Slot Usage:
When the request is processed, the server logs will show the details about the slot used, context tokens matched and any KV cache eviction. Execution logs for the first query confirm full prompt processing:
This indicates that slot 0 was used, 0 tokens were matched (
n_past - n_tokens
, or1 - progress
as fraction), the slot's KV cache was fully cleared before processing the prompt.Subsequent identical requests reuse the cached context on slot 0 (with the default value of 0.5 for
-sps
arg):This should result in significantly faster processing times as the KV cache is reused.
Manual Slot Assignment
Force a specific slot by including
id_slot
in the request:In this scenario, the prompt is processed from scratch in slot 1.
Partial Context Reuse
If you submit a slightly modified request (e.g. keeping the system prompt same with a different user query in the above example request), the server will leverage the existing cache in slot 1 for the common parts of the prompt and only process the new tokens.
Slot Persistence
This feature depends on
--slots
argument for the server, which enables the/slots
API endpoint.Warning
As per the server docs, this endpoint may change in the future and could pose a security risk on production systems. It is advised to consider this only if necessary and in secure or air-gapped setups.
Key parameters:
--slots
: Enable the/slots
API endpoint--slot-save-path
: Directory for persistent slot storageSave slot 0's KV cache
Output:
Restore to slot 1
Output:
Now slot 1 will contain the pre-computed KV cache, leading to faster response times. Subsequent requests will use the restored cache until the context changes significantly or the cache is explicitly cleared. You’ll see log output similar to that of a reused slot.
Implementation Considerations
Use Cases
Optimization Guidelines
Slot Management:
Performance:
Notes
-sps
parameter doesn’t always behave predictably, with inconsistent and unexpected slot switching observed. Further investigation is needed.Would welcome discussion from devs/maintainers about best practices and potential improvements.
Beta Was this translation helpful? Give feedback.
All reactions