Closed
Description
Name and Version
$ /mnt/nvme/llama-server/llama-server-be0e35 --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: Tesla P40, compute capability 6.1, VMM: yes
Device 2: Tesla P40, compute capability 6.1, VMM: yes
Device 3: Tesla P40, compute capability 6.1, VMM: yes
version: 4187 (be0e350c)
built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Problem description & steps to reproduce
There is a pretty consistent 16% tokens/second performance drop when using --cache-type-k q8_0 --cache-type-v q8_0
with a draft model. It doesn't happen if I don't use a draft model.
model | python | typescript | swift |
---|---|---|---|
qwen-coder-32b-q4 | 79.9 | 54.48 | 46.67 |
qwen-coder-32b-q4-kv | 66.60 (-16.6%) | 45.27 (-16%) | 39.24 (-15.9%) |
The test I am using is to prompt for a snake game to be written in python, typescript and swift.
for model in "qwen-coder-32b-q4" "qwen-coder-32b-q4-kv"; do
for lang in "python" "typescript" "swift"; do
echo "Generating Snake Game in $lang using $model"
curl -s --url http://localhost:8080/v1/chat/completions -d "{\"messages\": [{\"role\": \"system\", \"content\": \"you only write code.\"}, {\"role\": \"user\", \"content\": \"write snake game in $lang\"}], \"temperature\": 0.1, \"model\":\"$model\"}" > /dev/null
done
done
Here is my llama-swap configuration for the models above. The only difference is the addition of the cache-type flags:
models:
"qwen-coder-32b-q4":
# main model on 3090, draft on P40 #1
#
# gist results: python: 79.97 tps, typescript: 54.48 tps, swift: 46.67 tps
cmd: >
/mnt/nvme/llama-server/llama-server-be0e35
--host 127.0.0.1 --port 9503
--flash-attn --metrics
--slots
--model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
-ngl 99
--ctx-size 19000
--model-draft /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf
-ngld 99
--draft-max 16
--draft-min 4
--draft-p-min 0.4
--device CUDA0
--device-draft CUDA1
proxy: "http://127.0.0.1:9503"
"qwen-coder-32b-q4-kv":
# main model on 3090, draft on P40 #1
#
# gist results: python: 66.60, typescript 45.27, swift 39.24
cmd: >
/mnt/nvme/llama-server/llama-server-be0e35
--host 127.0.0.1 --port 9503
--flash-attn --metrics
--slots
--model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
-ngl 99
--ctx-size 19000
--model-draft /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf
-ngld 99
--draft-max 16
--draft-min 4
--draft-p-min 0.4
--device CUDA0
--device-draft CUDA1
--cache-type-k q8_0 --cache-type-v q8_0 <-- ADDED THESE FLAGS
proxy: "http://127.0.0.1:9503"
Raw test data:
# qwen-coder-32b-q4 (python, typescript, swift)
prompt eval time = 64.50 ms / 6 tokens ( 10.75 ms per token, 93.02 tokens per second)
eval time = 11264.22 ms / 900 tokens ( 12.52 ms per token, 79.90 tokens per second)
prompt eval time = 68.14 ms / 7 tokens ( 9.73 ms per token, 102.73 tokens per second)
eval time = 15766.30 ms / 859 tokens ( 18.35 ms per token, 54.48 tokens per second)
prompt eval time = 61.53 ms / 6 tokens ( 10.25 ms per token, 97.51 tokens per second)
eval time = 19349.37 ms / 903 tokens ( 21.43 ms per token, 46.67 tokens per second)
# qwen-coder-32b-q4-kv (python, typescript, swift)
prompt eval time = 52.95 ms / 23 tokens ( 2.30 ms per token, 434.37 tokens per second)
eval time = 13513.06 ms / 900 tokens ( 15.01 ms per token, 66.60 tokens per second)
prompt eval time = 69.98 ms / 7 tokens ( 10.00 ms per token, 100.03 tokens per second)
eval time = 19462.99 ms / 881 tokens ( 22.09 ms per token, 45.27 tokens per second)
prompt eval time = 63.20 ms / 6 tokens ( 10.53 ms per token, 94.94 tokens per second)
eval time = 27041.86 ms / 1061 tokens ( 25.49 ms per token, 39.24 tokens per second)
First Bad Commit
No response
Relevant log output
No response