-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Cache misses previous generation #1369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Same here, and not just llama 3, all other models are all have this issue after a recent update of llama.cpp. |
I also notice that the second prompt is slower than e.g. in gpt4all with otherwise same setup |
(And I don't know if this is normal, but I notice that the very first call has a different output to the subsequent ones. With temperature=0 I was expecting them to be all identical) [EDIT] Ignore my comment. I was looking at raw times instead of token/second. |
maybe the \n (s) are stripped off. |
+, I have the same issue for version 0.2.82 |
Check your prompt template |
I suppose that the llama-cpp-python server should mirror the llama-cpp server, which works with the prompt that I'm currently using, so it's clearly not the prompt. |
Uh oh!
There was an error while loading. Please reload this page.
Expected Behavior
The server should cache both the previous prompt and the last generation.
Current Behavior
The cache misses at the end of the previous prompt, forcing to evaluate the previous answer in full.
Environment and Context
I'm interfacing llama-cpp-python server trough a SillyTavern instance working in OpenAI compatible mode.
System is Linux orangepi5 6.1.43-rockchip-rk3588 #1.1.8 SMP Fri Feb 2 21:16:10 CST 2024 aarch64 GNU/Linux with
Python 3.11.7,
GNU Make 4.3 and
g++ (Debian 12.2.0-14) 12.2.0
llama-cpp-python: commit 0281214
fastapi 0.110.2
numpy 1.26.4
sse-starlette 2.1.0
uvicorn 0.29.0
vendor/llama.cpp: commit 0e4802b2ecbaab04b4f829fde4a3096ca19c84b5
Author: loonerin [email protected]
Date: Fri Apr 19 13:03:35 2024 -0400
Model metadata as reported by llama-cpp-python:
Failure Information (for bugs)
It seems that the last cached token is replaced by two different tokens in the next request.
Steps to Reproduce
{messages":[{"role":"system","content":"[Start a new Chat]"},{"role":"assistant","content":"Hello"},{"role":"user","content":"Write one word"}],"model":"../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true","temperature":1,"max_tokens":2048,"stream":true,"presence_penalty":0,"frequency_penalty":0,"top_p":1,"logit_bias":{},"seed":0}
the last streamed messages are:
data: {"id": "chatcmpl-47757f4e-8c62-42e5-8169-147e86c17762", "model": "../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true", "created": 1713664603, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": ":\n\n"}, "logprobs": null, "finish_reason": null}]}
data: {"id": "chatcmpl-47757f4e-8c62-42e5-8169-147e86c17762", "model": "../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true", "created": 1713664603, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": "Hello"}, "logprobs": null, "finish_reason": null}]}
data: {"id": "chatcmpl-47757f4e-8c62-42e5-8169-147e86c17762", "model": "../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true", "created": 1713664603, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {}, "logprobs": null, "finish_reason": "stop"}]}
{"messages":[{"role":"system","content":"[Start a new Chat]"},{"role":"assistant","content":"Hello"},{"role":"user","content":"Write one word"},{"role":"assistant","content":"Hello! You said \"one word\", so I'll respond with:\n\nHello"},{"role":"user","content":"Hi"}],"model":"../llama.cpp/models/llama-3-8b-Instruct.Q4_K_M.gguf?download=true","temperature":1,"max_tokens":2048,"stream":true,"presence_penalty":0,"frequency_penalty":0,"top_p":1,"logit_bias":{},"seed":0}
Which give the following output:
As can be seen, token 31 (271) is replaced by two (198) so the cache does not match the prompt. However, I cannot see any difference in messages nor detokenized strings.
The text was updated successfully, but these errors were encountered: