Tutorial: KV cache reuse with llama-server #13606

smahs · 2025-05-17T14:27:51Z

smahs
May 17, 2025

This tutorial demonstrates how to use the slots management feature in llama-server to optimize repeated prompt processing through KV cache reuse.

Server Setup

Start the server with the desired number of slots:

llama-server --port 8000 \
  -m models/gemma2/gemma-2-2b-it-Q4_K_M.gguf \
  -c 1024 -np 2

Key parameters:

-c 1024 -np 2: Creates 2 processing slots with (1024/2) = 512 context tokens each

Upon startup, the logs should indicate the initialization of two slots:

srv          init: initializing slots, n_slots = 2
slot         init: id  0 | task -1 | new slot n_ctx_slot = 512
slot         init: id  1 | task -1 | new slot n_ctx_slot = 512

By default, llama-server attempts to assign slot to a new request based on prompt similarity. The -sps parameter controls this behavior. A value of 0.5 (the default) means a slot is considered a match if at least 50% of the prompt context matches.

Automatic slot selection can be disabled by setting -sps 0.0. The slot to use can be explicitly specified in the request using the "id_slot" parameter in the request body.

Prompt caching

Basic Request Example

Here's a sample curl request:

curl 'http://localhost:8000/v1/chat/completions' -X POST \
  -H 'Content-Type: application/json' \
  --data-raw '{
    "messages": [
      {
        "role": "system",
        "content": "Extract facets from this search query and return as JSON..."
      },
      {
        "role": "user",
        "content": "watch with metal strap"
      }
    ],
    "stream": false,
    "cache_prompt": true
  }'

The parameter cache_prompt: true (default) instructs the server to cache the prompt. While redundant in this case, it's good practice to be explicit.

Server Logs & Slot Usage:

When the request is processed, the server logs will show the details about the slot used, context tokens matched and any KV cache eviction. Execution logs for the first query confirm full prompt processing:

slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 43
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 43, n_tokens = 43, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 43, n_tokens = 43

This indicates that slot 0 was used, 0 tokens were matched (n_past - n_tokens, or 1 - progress as fraction), the slot's KV cache was fully cleared before processing the prompt.

Subsequent identical requests reuse the cached context on slot 0 (with the default value of 0.5 for -sps arg):

slot update_slots: id  0 | task 31 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 43
slot update_slots: id  0 | task 31 | need to evaluate at least 1 token to generate logits, n_past = 43, n_prompt_tokens = 43
slot update_slots: id  0 | task 31 | kv cache rm [42, end)
slot update_slots: id  0 | task 31 | prompt processing progress, n_past = 43, n_tokens = 1, progress = 0.023256
slot update_slots: id  0 | task 31 | prompt done, n_past = 43, n_tokens = 1

This should result in significantly faster processing times as the KV cache is reused.

Manual Slot Assignment

Force a specific slot by including id_slot in the request:

{
  "id_slot": 1,
  "messages": [...]
}

In this scenario, the prompt is processed from scratch in slot 1.

slot update_slots: id  1 | task 31 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 43
slot update_slots: id  1 | task 31 | kv cache rm [0, end)
slot update_slots: id  1 | task 31 | prompt processing progress, n_past = 43, n_tokens = 43, progress = 1.000000
slot update_slots: id  1 | task 31 | prompt done, n_past = 43, n_tokens = 43

Partial Context Reuse

If you submit a slightly modified request (e.g. keeping the system prompt same with a different user query in the above example request), the server will leverage the existing cache in slot 1 for the common parts of the prompt and only process the new tokens.

slot update_slots: id  1 | task 115 | kv cache rm [33, end)
slot update_slots: id  1 | task 115 | prompt processing progress, n_past = 45, n_tokens = 12, progress = 0.266667
slot update_slots: id  1 | task 115 | prompt done, n_past = 45, n_tokens = 12

Slot Persistence

This feature depends on --slots argument for the server, which enables the /slots API endpoint.

Warning

As per the server docs, this endpoint may change in the future and could pose a security risk on production systems. It is advised to consider this only if necessary and in secure or air-gapped setups.

llama-server --port 8000 \
  -m models/gemma2/gemma-2-2b-it-Q4_K_M.gguf \
  -c 1024 -np 2 \
  --slots --slot-save-path /tmp

Key parameters:

--slots: Enable the /slots API endpoint
--slot-save-path: Directory for persistent slot storage

Save slot 0's KV cache

curl -X POST 'http://localhost:8000/slots/0?action=save' \
  -H 'Content-Type: application/json' \
  -d '{"filename": "slot0.bin"}'

Output:

{"id_slot":0,"filename":"slot0.bin","n_saved":72,"n_written":7669224,"timings":{"save_ms":1.401}}

Restore to slot 1

curl -X POST 'http://localhost:8000/slots/1?action=restore' \
  -H 'Content-Type: application/json' \
  -d '{"filename": "slot0.bin"}'

Output:

{"id_slot":1,"filename":"slot0.bin","n_restored":72,"n_read":7669224,"timings":{"restore_ms":0.739}}

Now slot 1 will contain the pre-computed KV cache, leading to faster response times. Subsequent requests will use the restored cache until the context changes significantly or the cache is explicitly cleared. You’ll see log output similar to that of a reused slot.

Implementation Considerations

Use Cases

Applications with long, repeated system prompts
Multi-tenant environments needing prompt isolation
Batch processing with similar prompt templates

Optimization Guidelines

Slot Management:
- Client-side slot control is recommended for production
- Pre-warm slots with expected prompt templates
Performance:
- Use fast NVMe storage for slot persistence
- Consider PCIe gen4/gen5 for GPU-accelerated setups

Notes

The -sps parameter doesn’t always behave predictably, with inconsistent and unexpected slot switching observed. Further investigation is needed.

Would welcome discussion from devs/maintainers about best practices and potential improvements.

ggerganov · 2025-05-19T08:07:29Z

ggerganov
May 19, 2025
Maintainer

This is great!

I have one concern however - the --slots argument is technically unsafe and is not recommended to be used widely (see the server readme). So maybe we should split the tutorial in 2 parts - the first one that does not depend on the --slots argument and the second that does. And for the second part, we should put a warning that it uses an unsafe feature of llama-server:

Warning

The --slots argument is unsafe and ...

6 replies

ggerganov May 20, 2025
Maintainer

Made a few edits to the post. Adding this now to the list of tutorials. Thanks!

smahs May 20, 2025
Author

I see the sliding window attention PR 13194 is merged in build b5429. Does it changes anything and should it be called out here?

ggerganov May 20, 2025
Maintainer

If a SWA model like Gemma 2/3 is used, we need to add --swa-full to support some of the techniques in the tutorial. But if the --swa-full is not added, there will be warning messages in the log, hinting to do so.

In any case, for the context size used in the tutorial (i.e. 512 per slot), the --swa-full flag is not even needed - it starts to make a difference for contexts larger than the sliding window (1024 for Gemma).

Anyway, I think it's ok to leave the tutorial as it is, in order to focus on the cache functionality.

Mihaiii May 21, 2025

@smahs This is really great and super useful for real world apps. Thank you!

Adding this now to the list of tutorials

@ggerganov Where can this list be found? It seems to be a different list than the docs directory and I did a quick "tutorial" search in readme, but no match.

ggerganov May 21, 2025
Maintainer

You can navigate starting from the ggml-org projects.

ggerganov · 2025-05-20T06:07:16Z

ggerganov
May 20, 2025
Maintainer

The -sps parameter doesn’t always behave predictably, with inconsistent and unexpected slot switching observed. Further investigation is needed.

If you provide repro steps about this problem, we can take a look and fix it.

0 replies

15872909671 · 2025-05-23T15:47:12Z

15872909671
May 23, 2025

can we cache context in http client?the tutorial for cache prompt "cache_prompt": true int post seem not to work,thank you.

0 replies

ExtReMLapin · 2025-05-24T17:27:31Z

ExtReMLapin
May 24, 2025

My issue with this feature is that if you have 10 slots and 11 cyclic different prompts, it will never reuse cache if I understood how it works correctly.

5 replies

smahs May 24, 2025
Author

It may if the prefix matches an existing slot, but not guaranteed (it seems more complex than just having more prompts than the number of slots, as I noted about inconsistency in the behavior of -sps but I haven't been able to get around debugging it thoroughly). The deterministic approach is to recover a pre-saved context to a slot just-in-time before sending a completions request and asking explicitly for that slot to be used.

ExtReMLapin May 24, 2025

My point is that if you have more prompts (it's cyclic), the last one will overwrite the first and and when you try to reuse the 1st it will be deleted from the last query and overwrite slot number two 2 and when you try to use prompt number 2 it will not match anything and use slot 3 etc

smahs May 24, 2025
Author

Sorry correct me if I am wrong but that's the intended behavior no? Assuming you're managing the slot-prompt mapping client side, just create 11 slots (or more for the needed concurrency) to begin with.

ExtReMLapin May 24, 2025

You're missing my point.

From my understanding, Prompt Processing cache is done at slot level, not llama-server level.

Let's say your system can allow up to 10 slots, because of VRAM restrictions, you want more (as much as possible), but you expect each individual slots to support 8k context size so it's as much as you can.

But in practice you have 15 different users, they don't sent prompts at the same time so it's fine.

The first 10 users will start their own conversation, they will because of conversation similarity always stick to their slot and not "prune" any other slot cache.

As soon as you have an user number 11, it will prune the cache of the least recently used slot, because there is only 10 slots available.

But if this guy that just got their slot "stolen" by user number 11 tries to continue his conversation, llama.cpp server will need to redo the prompt processing of the whole conversation again, instead.

smahs May 24, 2025
Author

To me this is the correct behavior from llama-server's perspective and it should be the client's responsibility to manage the contentions, but I will let a maintainer comment. I would just let the 11th request wait at the client end and only pass it for processing after saving the context (to clarify, the client is also a server which intercepts the requests going to llama-server and manages the resource allocation).

Tutorial: KV cache reuse with llama-server #13606

Uh oh!

Uh oh!

smahs May 17, 2025

Server Setup

Prompt caching

Slot Persistence

Implementation Considerations

Replies: 4 comments · 11 replies

Uh oh!

ggerganov May 19, 2025 Maintainer

Uh oh!

ggerganov May 20, 2025 Maintainer

Uh oh!

smahs May 20, 2025 Author

Uh oh!

ggerganov May 20, 2025 Maintainer

Uh oh!

Uh oh!

Mihaiii May 21, 2025

Uh oh!

ggerganov May 21, 2025 Maintainer

Uh oh!

ggerganov May 20, 2025 Maintainer

Uh oh!

15872909671 May 23, 2025

Uh oh!

ExtReMLapin May 24, 2025

Uh oh!

Uh oh!

smahs May 24, 2025 Author

Uh oh!

ExtReMLapin May 24, 2025

Uh oh!

smahs May 24, 2025 Author

Uh oh!

ExtReMLapin May 24, 2025

Uh oh!

smahs May 24, 2025 Author

smahs
May 17, 2025

Replies: 4 comments 11 replies

ggerganov
May 19, 2025
Maintainer

ggerganov May 20, 2025
Maintainer

smahs May 20, 2025
Author

ggerganov May 20, 2025
Maintainer

ggerganov May 21, 2025
Maintainer

ggerganov
May 20, 2025
Maintainer

15872909671
May 23, 2025

ExtReMLapin
May 24, 2025

smahs May 24, 2025
Author

smahs May 24, 2025
Author

smahs May 24, 2025
Author