Skip to content

Cache misses previous generation #1369

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ultoris opened this issue Apr 21, 2024 · 7 comments
Open

Cache misses previous generation #1369

ultoris opened this issue Apr 21, 2024 · 7 comments

Comments

@ultoris
Copy link

ultoris commented Apr 21, 2024

Expected Behavior

The server should cache both the previous prompt and the last generation.

Current Behavior

The cache misses at the end of the previous prompt, forcing to evaluate the previous answer in full.

Environment and Context

I'm interfacing llama-cpp-python server trough a SillyTavern instance working in OpenAI compatible mode.
System is Linux orangepi5 6.1.43-rockchip-rk3588 #1.1.8 SMP Fri Feb 2 21:16:10 CST 2024 aarch64 GNU/Linux with
Python 3.11.7,
GNU Make 4.3 and
g++ (Debian 12.2.0-14) 12.2.0

llama-cpp-python: commit 0281214
fastapi 0.110.2
numpy 1.26.4
sse-starlette 2.1.0
uvicorn 0.29.0
vendor/llama.cpp: commit 0e4802b2ecbaab04b4f829fde4a3096ca19c84b5
Author: loonerin [email protected]
Date: Fri Apr 19 13:03:35 2024 -0400

Model metadata as reported by llama-cpp-python:

Model metadata: {'tokenizer.chat_template': "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}", 'tokenizer.ggml.eos_token_id': '128009', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'gpt2', 'general.architecture': 'llama', 'llama.rope.freq_base': '500000.000000', 'llama.context_length': '8192', 'general.name': 'llama-3-8b-Instruct', 'llama.vocab_size': '128256', 'general.file_type': '15', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '128000', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8'}
Using gguf chat template: {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}
Using chat eos_token: <|eot_id|>
Using chat bos_token: <|begin_of_text|>

Failure Information (for bugs)

It seems that the last cached token is replaced by two different tokens in the next request.

Steps to Reproduce

  1. Run the server with: python3 -m llama_cpp.server --cache true --n_ctx 8192 --seed 0 --n_threads 4 --n_threads_batch 4 --model ../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf --port 8080 --verbose true --cache_type ram --use_mlock true
  2. Send the message "Write one word". SillyTavern sends the following message (captured trough tcpdump):
    {messages":[{"role":"system","content":"[Start a new Chat]"},{"role":"assistant","content":"Hello"},{"role":"user","content":"Write one word"}],"model":"../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true","temperature":1,"max_tokens":2048,"stream":true,"presence_penalty":0,"frequency_penalty":0,"top_p":1,"logit_bias":{},"seed":0}
    the last streamed messages are:

data: {"id": "chatcmpl-47757f4e-8c62-42e5-8169-147e86c17762", "model": "../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true", "created": 1713664603, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": ":\n\n"}, "logprobs": null, "finish_reason": null}]}

data: {"id": "chatcmpl-47757f4e-8c62-42e5-8169-147e86c17762", "model": "../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true", "created": 1713664603, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": "Hello"}, "logprobs": null, "finish_reason": null}]}

data: {"id": "chatcmpl-47757f4e-8c62-42e5-8169-147e86c17762", "model": "../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true", "created": 1713664603, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {}, "logprobs": null, "finish_reason": "stop"}]}

  1. Send the message "Hi". SillyTavern sends the following message (captured trough tcpdump):
    {"messages":[{"role":"system","content":"[Start a new Chat]"},{"role":"assistant","content":"Hello"},{"role":"user","content":"Write one word"},{"role":"assistant","content":"Hello! You said \"one word\", so I'll respond with:\n\nHello"},{"role":"user","content":"Hi"}],"model":"../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true","temperature":1,"max_tokens":2048,"stream":true,"presence_penalty":0,"frequency_penalty":0,"top_p":1,"logit_bias":{},"seed":0}
  2. I've added the following prints to generate() in llama.py
                if self.verbose:
                    print("Llama.generate: prefix-match hit ({pre}/{prompt}/{total}).".format(pre=longest_prefix,total=self.n_tokens,prompt=len(tokens)))
                    print("Llama.generate: prefix-prompt ",repr(self.detokenize(tokens)),file=sys.stderr)
                    print("Llama.generate: prefix-match: ",repr(self.detokenize(self._input_ids[:longest_prefix])), file=sys.stderr)
                    print("Llama.generate: prefix-miss: ",repr(self.detokenize(self._input_ids[longest_prefix:])), file=sys.stderr)

                    for i,p in enumerate(zip(self._input_ids,tokens)):
                        print("{idx: <8}{a: <8}{b: <8}".format(idx=i,a=p[0],b=p[1]),file=sys.stderr)

Which give the following output:

Llama._create_completion: cache saved
Llama.generate: prefix-match hit (31/61/47).
Llama.generate: prefix-prompt  b'ystem\n\n[Start a new Chat]assistant\n\nHellouser\n\nWrite one wordassistant\n\nHello! You said "one word", so I\'ll respond with:\n\nHellouser\n\nHiassistant\n\n'
Llama.generate: prefix-match:  b'ystem\n\n[Start a new Chat]assistant\n\nHellouser\n\nWrite one wordassistant'
Llama.generate: prefix-miss:  b'\n\nHello! You said "one word", so I\'ll respond with:\n\nHello'
0       128000  128000
1       128006  128006
2       9125    9125
3       128007  128007
4       271     271
5       58      58
6       3563    3563
7       264     264
8       502     502
9       13149   13149
10      60      60
11      128009  128009
12      128006  128006
13      78191   78191
14      128007  128007
15      198     198
16      198     198
17      9906    9906
18      128009  128009
19      128006  128006
20      882     882
21      128007  128007
22      198     198
23      198     198
24      8144    8144
25      832     832
26      3492    3492
27      128009  128009
28      128006  128006
29      78191   78191
30      128007  128007
31      271     198
32      9906    198
33      0       9906
34      1472    0
35      1071    1472
36      330     1071
37      606     330
38      3492    606
39      498     3492
40      779     498
41      358     779
42      3358    358
43      6013    3358
44      449     6013
45      1473    449
46      9906    512

As can be seen, token 31 (271) is replaced by two (198) so the cache does not match the prompt. However, I cannot see any difference in messages nor detokenized strings.

@Liquorice10113
Copy link

Same here, and not just llama 3, all other models are all have this issue after a recent update of llama.cpp.

@woheller69
Copy link

I also notice that the second prompt is slower than e.g. in gpt4all with otherwise same setup

@Vaskivo
Copy link

Vaskivo commented Jun 21, 2024

I can confirm this. I suspect that the issue started with version 0.2.77.

I've booted the server and ran this curl multiple times:

curl --location 'http://127.0.0.1:8080/v1/completions' \
--header 'Content-Type: application/json' \
--data '{
  "prompt": "<BIG PROMPT BUT PROPERLY FORMATTED>",
  "max_tokens": 2048,
  "temperature": 0,
  "stream": false
}'

(And I don't know if this is normal, but I notice that the very first call has a different output to the subsequent ones. With temperature=0 I was expecting them to be all identical)

[EDIT] Ignore my comment. I was looking at raw times instead of token/second.

@woheller69
Copy link

maybe the \n (s) are stripped off.
See Maximilian-Winter/llama-cpp-agent#73

@futurisold
Copy link

+, I have the same issue for version 0.2.82

@woheller69
Copy link

+, I have the same issue for version 0.2.82

Check your prompt template

@futurisold
Copy link

+, I have the same issue for version 0.2.82

Check your prompt template

I suppose that the llama-cpp-python server should mirror the llama-cpp server, which works with the prompt that I'm currently using, so it's clearly not the prompt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants