-
Notifications
You must be signed in to change notification settings - Fork 12.9k
Description
Name and Version
llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
version: 5269 (1d36b36)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
rtx 5060ti 16GB + rtx4060ti 16GB
Models
Qwen_Qwen3-30B-A3B-Q6_K.gguf by bartowski.
sha256sum: d511d02955714b08ff1b4354d6eae8ea513179a83fa5498466db2731528074dd
Problem description & steps to reproduce
I'm using a grammar to simulate the nothink qwen prompt format. Sometimes the output is generated correctly, sometimes the model outputs the wrong token while still aligned with the grammar.
The command I'm using to test:
curl http://localhost:8080/completion -H "Content-Type: application/json" -d '{
"prompt": "<|im_start|>system\n<|im_end|>\n<|im_start|>user\nhi<|im_end|>\n<|im_start|>assistant\n",
"grammar": "root ::= \"<think>\\n\\n</think>\\n\\n\" .*",
"temperature": 0.001,
"n_predict": 6,
"seed": 42
}'
Correct output: [151667 198 198 151668 ...] <think>\n\n</think>...
Wrong output: [151667 198 198 27 14 ...] <think>\n\n</...
Sometimes the model outputs the correct output, sometimes it outputs the wrong output and the following output breaks since the model cannot see the </think>
token. I'm not restarting llama-server between tests and not changing the seed. I expect the model to always output the token 151668
Command line used to launch llama-server: /llama-server -ngl 175 -t 6 -c 32768 --host 0.0.0.0 -fa -ctk q8_0 -ctv q8_0 --slots -a current --temp 0.6
First Bad Commit
No response
Relevant log output
`{"index":0,"content":"<think>\n\n</think"...`