Skip to content

Cache misses previous generation #1369

Open
@ultoris

Description

@ultoris

Expected Behavior

The server should cache both the previous prompt and the last generation.

Current Behavior

The cache misses at the end of the previous prompt, forcing to evaluate the previous answer in full.

Environment and Context

I'm interfacing llama-cpp-python server trough a SillyTavern instance working in OpenAI compatible mode.
System is Linux orangepi5 6.1.43-rockchip-rk3588 #1.1.8 SMP Fri Feb 2 21:16:10 CST 2024 aarch64 GNU/Linux with
Python 3.11.7,
GNU Make 4.3 and
g++ (Debian 12.2.0-14) 12.2.0

llama-cpp-python: commit 0281214
fastapi 0.110.2
numpy 1.26.4
sse-starlette 2.1.0
uvicorn 0.29.0
vendor/llama.cpp: commit 0e4802b2ecbaab04b4f829fde4a3096ca19c84b5
Author: loonerin [email protected]
Date: Fri Apr 19 13:03:35 2024 -0400

Model metadata as reported by llama-cpp-python:

Model metadata: {'tokenizer.chat_template': "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}", 'tokenizer.ggml.eos_token_id': '128009', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'gpt2', 'general.architecture': 'llama', 'llama.rope.freq_base': '500000.000000', 'llama.context_length': '8192', 'general.name': 'llama-3-8b-Instruct', 'llama.vocab_size': '128256', 'general.file_type': '15', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '128000', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8'}
Using gguf chat template: {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}
Using chat eos_token: <|eot_id|>
Using chat bos_token: <|begin_of_text|>

Failure Information (for bugs)

It seems that the last cached token is replaced by two different tokens in the next request.

Steps to Reproduce

  1. Run the server with: python3 -m llama_cpp.server --cache true --n_ctx 8192 --seed 0 --n_threads 4 --n_threads_batch 4 --model ../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf --port 8080 --verbose true --cache_type ram --use_mlock true
  2. Send the message "Write one word". SillyTavern sends the following message (captured trough tcpdump):
    {messages":[{"role":"system","content":"[Start a new Chat]"},{"role":"assistant","content":"Hello"},{"role":"user","content":"Write one word"}],"model":"../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true","temperature":1,"max_tokens":2048,"stream":true,"presence_penalty":0,"frequency_penalty":0,"top_p":1,"logit_bias":{},"seed":0}
    the last streamed messages are:

data: {"id": "chatcmpl-47757f4e-8c62-42e5-8169-147e86c17762", "model": "../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true", "created": 1713664603, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": ":\n\n"}, "logprobs": null, "finish_reason": null}]}

data: {"id": "chatcmpl-47757f4e-8c62-42e5-8169-147e86c17762", "model": "../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true", "created": 1713664603, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": "Hello"}, "logprobs": null, "finish_reason": null}]}

data: {"id": "chatcmpl-47757f4e-8c62-42e5-8169-147e86c17762", "model": "../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true", "created": 1713664603, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {}, "logprobs": null, "finish_reason": "stop"}]}

  1. Send the message "Hi". SillyTavern sends the following message (captured trough tcpdump):
    {"messages":[{"role":"system","content":"[Start a new Chat]"},{"role":"assistant","content":"Hello"},{"role":"user","content":"Write one word"},{"role":"assistant","content":"Hello! You said \"one word\", so I'll respond with:\n\nHello"},{"role":"user","content":"Hi"}],"model":"../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true","temperature":1,"max_tokens":2048,"stream":true,"presence_penalty":0,"frequency_penalty":0,"top_p":1,"logit_bias":{},"seed":0}
  2. I've added the following prints to generate() in llama.py
                if self.verbose:
                    print("Llama.generate: prefix-match hit ({pre}/{prompt}/{total}).".format(pre=longest_prefix,total=self.n_tokens,prompt=len(tokens)))
                    print("Llama.generate: prefix-prompt ",repr(self.detokenize(tokens)),file=sys.stderr)
                    print("Llama.generate: prefix-match: ",repr(self.detokenize(self._input_ids[:longest_prefix])), file=sys.stderr)
                    print("Llama.generate: prefix-miss: ",repr(self.detokenize(self._input_ids[longest_prefix:])), file=sys.stderr)

                    for i,p in enumerate(zip(self._input_ids,tokens)):
                        print("{idx: <8}{a: <8}{b: <8}".format(idx=i,a=p[0],b=p[1]),file=sys.stderr)

Which give the following output:

Llama._create_completion: cache saved
Llama.generate: prefix-match hit (31/61/47).
Llama.generate: prefix-prompt  b'ystem\n\n[Start a new Chat]assistant\n\nHellouser\n\nWrite one wordassistant\n\nHello! You said "one word", so I\'ll respond with:\n\nHellouser\n\nHiassistant\n\n'
Llama.generate: prefix-match:  b'ystem\n\n[Start a new Chat]assistant\n\nHellouser\n\nWrite one wordassistant'
Llama.generate: prefix-miss:  b'\n\nHello! You said "one word", so I\'ll respond with:\n\nHello'
0       128000  128000
1       128006  128006
2       9125    9125
3       128007  128007
4       271     271
5       58      58
6       3563    3563
7       264     264
8       502     502
9       13149   13149
10      60      60
11      128009  128009
12      128006  128006
13      78191   78191
14      128007  128007
15      198     198
16      198     198
17      9906    9906
18      128009  128009
19      128006  128006
20      882     882
21      128007  128007
22      198     198
23      198     198
24      8144    8144
25      832     832
26      3492    3492
27      128009  128009
28      128006  128006
29      78191   78191
30      128007  128007
31      271     198
32      9906    198
33      0       9906
34      1472    0
35      1071    1472
36      330     1071
37      606     330
38      3492    606
39      498     3492
40      779     498
41      358     779
42      3358    358
43      6013    3358
44      449     6013
45      1473    449
46      9906    512

As can be seen, token 31 (271) is replaced by two (198) so the cache does not match the prompt. However, I cannot see any difference in messages nor detokenized strings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions