Description
Expected Behavior
The server should cache both the previous prompt and the last generation.
Current Behavior
The cache misses at the end of the previous prompt, forcing to evaluate the previous answer in full.
Environment and Context
I'm interfacing llama-cpp-python server trough a SillyTavern instance working in OpenAI compatible mode.
System is Linux orangepi5 6.1.43-rockchip-rk3588 #1.1.8 SMP Fri Feb 2 21:16:10 CST 2024 aarch64 GNU/Linux with
Python 3.11.7,
GNU Make 4.3 and
g++ (Debian 12.2.0-14) 12.2.0
llama-cpp-python: commit 0281214
fastapi 0.110.2
numpy 1.26.4
sse-starlette 2.1.0
uvicorn 0.29.0
vendor/llama.cpp: commit 0e4802b2ecbaab04b4f829fde4a3096ca19c84b5
Author: loonerin [email protected]
Date: Fri Apr 19 13:03:35 2024 -0400
Model metadata as reported by llama-cpp-python:
Model metadata: {'tokenizer.chat_template': "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}", 'tokenizer.ggml.eos_token_id': '128009', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'gpt2', 'general.architecture': 'llama', 'llama.rope.freq_base': '500000.000000', 'llama.context_length': '8192', 'general.name': 'llama-3-8b-Instruct', 'llama.vocab_size': '128256', 'general.file_type': '15', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '128000', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8'}
Using gguf chat template: {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>
'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>
' }}
Using chat eos_token: <|eot_id|>
Using chat bos_token: <|begin_of_text|>
Failure Information (for bugs)
It seems that the last cached token is replaced by two different tokens in the next request.
Steps to Reproduce
- Run the server with: python3 -m llama_cpp.server --cache true --n_ctx 8192 --seed 0 --n_threads 4 --n_threads_batch 4 --model ../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf --port 8080 --verbose true --cache_type ram --use_mlock true
- Send the message "Write one word". SillyTavern sends the following message (captured trough tcpdump):
{messages":[{"role":"system","content":"[Start a new Chat]"},{"role":"assistant","content":"Hello"},{"role":"user","content":"Write one word"}],"model":"../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true","temperature":1,"max_tokens":2048,"stream":true,"presence_penalty":0,"frequency_penalty":0,"top_p":1,"logit_bias":{},"seed":0}
the last streamed messages are:
data: {"id": "chatcmpl-47757f4e-8c62-42e5-8169-147e86c17762", "model": "../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true", "created": 1713664603, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": ":\n\n"}, "logprobs": null, "finish_reason": null}]}
data: {"id": "chatcmpl-47757f4e-8c62-42e5-8169-147e86c17762", "model": "../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true", "created": 1713664603, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": "Hello"}, "logprobs": null, "finish_reason": null}]}
data: {"id": "chatcmpl-47757f4e-8c62-42e5-8169-147e86c17762", "model": "../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true", "created": 1713664603, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {}, "logprobs": null, "finish_reason": "stop"}]}
- Send the message "Hi". SillyTavern sends the following message (captured trough tcpdump):
{"messages":[{"role":"system","content":"[Start a new Chat]"},{"role":"assistant","content":"Hello"},{"role":"user","content":"Write one word"},{"role":"assistant","content":"Hello! You said \"one word\", so I'll respond with:\n\nHello"},{"role":"user","content":"Hi"}],"model":"../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true","temperature":1,"max_tokens":2048,"stream":true,"presence_penalty":0,"frequency_penalty":0,"top_p":1,"logit_bias":{},"seed":0}
- I've added the following prints to generate() in llama.py
if self.verbose:
print("Llama.generate: prefix-match hit ({pre}/{prompt}/{total}).".format(pre=longest_prefix,total=self.n_tokens,prompt=len(tokens)))
print("Llama.generate: prefix-prompt ",repr(self.detokenize(tokens)),file=sys.stderr)
print("Llama.generate: prefix-match: ",repr(self.detokenize(self._input_ids[:longest_prefix])), file=sys.stderr)
print("Llama.generate: prefix-miss: ",repr(self.detokenize(self._input_ids[longest_prefix:])), file=sys.stderr)
for i,p in enumerate(zip(self._input_ids,tokens)):
print("{idx: <8}{a: <8}{b: <8}".format(idx=i,a=p[0],b=p[1]),file=sys.stderr)
Which give the following output:
Llama._create_completion: cache saved
Llama.generate: prefix-match hit (31/61/47).
Llama.generate: prefix-prompt b'ystem\n\n[Start a new Chat]assistant\n\nHellouser\n\nWrite one wordassistant\n\nHello! You said "one word", so I\'ll respond with:\n\nHellouser\n\nHiassistant\n\n'
Llama.generate: prefix-match: b'ystem\n\n[Start a new Chat]assistant\n\nHellouser\n\nWrite one wordassistant'
Llama.generate: prefix-miss: b'\n\nHello! You said "one word", so I\'ll respond with:\n\nHello'
0 128000 128000
1 128006 128006
2 9125 9125
3 128007 128007
4 271 271
5 58 58
6 3563 3563
7 264 264
8 502 502
9 13149 13149
10 60 60
11 128009 128009
12 128006 128006
13 78191 78191
14 128007 128007
15 198 198
16 198 198
17 9906 9906
18 128009 128009
19 128006 128006
20 882 882
21 128007 128007
22 198 198
23 198 198
24 8144 8144
25 832 832
26 3492 3492
27 128009 128009
28 128006 128006
29 78191 78191
30 128007 128007
31 271 198
32 9906 198
33 0 9906
34 1472 0
35 1071 1472
36 330 1071
37 606 330
38 3492 606
39 498 3492
40 779 498
41 358 779
42 3358 358
43 6013 3358
44 449 6013
45 1473 449
46 9906 512
As can be seen, token 31 (271) is replaced by two (198) so the cache does not match the prompt. However, I cannot see any difference in messages nor detokenized strings.