-
Notifications
You must be signed in to change notification settings - Fork 12k
Bug: Gemma 2 incoherent output when using quantized k cache without Flash Attention #8853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I can't reproduce with RTX 2060: GGML_CUDA=1 make -j && ./llama-server -m models/gemma-2-9b-it/ggml-model-q4_k_s.gguf -t 6 -c 8192 -ngl 31 -ctk q4_0 --host 127.0.0.1 --port 8080 I ccache found, compilation results will be cached. Disable with GGML_NO_CCACHE. /usr/bin/ccache c++ -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -fopenmp -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -DGGML_CUDA_USE_GRAPHS examples/deprecation-warning/deprecation-warning.o -o main -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/local/cuda/lib64/stubs -L/usr/lib/wsl/lib curl -s --request POST --url http://localhost:8080/completion --header "Content-Type: application/json" --data '{"prompt": "hello, how can", "n_predict": 16}' | jq
{
"content": " I find a good mechanic in my area?\n\nIt's tough to find",
"id_slot": 0,
"stop": true,
"model": "models/gemma-2-9b-it/ggml-model-q4_k_s.gguf",
"tokens_predicted": 16,
"tokens_evaluated": 5,
"generation_settings": {
"n_ctx": 8192,
"n_predict": -1,
"model": "models/gemma-2-9b-it/ggml-model-q4_k_s.gguf",
"seed": 4294967295,
"temperature": 0.800000011920929,
"dynatemp_range": 0,
"dynatemp_exponent": 1,
"top_k": 40,
"top_p": 0.949999988079071,
"min_p": 0.05000000074505806,
"tfs_z": 1,
"typical_p": 1,
"repeat_last_n": 64,
"repeat_penalty": 1,
"presence_penalty": 0,
"frequency_penalty": 0,
"penalty_prompt_tokens": [],
"use_penalty_prompt_tokens": false,
"mirostat": 0,
"mirostat_tau": 5,
"mirostat_eta": 0.10000000149011612,
"penalize_nl": false,
"stop": [],
"n_keep": 0,
"n_discard": 0,
"ignore_eos": false,
"stream": false,
"logit_bias": [],
"n_probs": 0,
"min_keep": 0,
"grammar": "",
"samplers": [
"top_k",
"tfs_z",
"typical_p",
"top_p",
"min_p",
"temperature"
]
},
"prompt": "hello, how can",
"truncated": false,
"stopped_eos": false,
"stopped_word": false,
"stopped_limit": true,
"stopping_word": "",
"tokens_cached": 20,
"timings": {
"prompt_n": 5,
"prompt_ms": 126.053,
"prompt_per_token_ms": 25.2106,
"prompt_per_second": 39.66585483883763,
"predicted_n": 16,
"predicted_ms": 1317.154,
"predicted_per_token_ms": 82.322125,
"predicted_per_second": 12.147402657548016
}
} |
I've just tested it and yes you are right, right off the bat with llama.cpp server, it works. However, as soon as you use the chat template, it stops working properly. Here's how to reproduce: When you open the llama.cpp server gui, click on new UI. Then you choose ChatML without system prompt. Edit the chatml tokens accordingly with the correct Gemma 2 prompt template. Then chat with the model below. Does it still work? |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Uh oh!
There was an error while loading. Please reload this page.
What happened?
Output like "Mh giàu され rodas reliablyacheteurδε Są" happens when using quantized K cache, CUDA, with Gemma 2. Here's how to reproduce:
./llama-server -m "Gemma-2-9B-It-SPPO-Iter3-Q4_K_S.gguf" -t 6 -c 8192 -ngl 31 -ctk q4_0 --host 127.0.0.1 --port 8080
Then connect a frontend like SillyTavern to it. Strangely this only happens with server, not with main-cli.
This leads to incoherent output. Note: I can't say if this issue happens when using full offloading, as I just have 6 GB VRAM.
Name and Version
./llama-cli --version
version: 3506 (76614f3)
built with MSVC 19.29.30154.0 for x64
What operating system are you seeing the problem on?
Windows
Relevant log output
No response
The text was updated successfully, but these errors were encountered: