Skip to content

Misc. bug: [server] Using q8_0 for KV cache reduces performance when also using a draft model #10552

Closed
@mostlygeek

Description

@mostlygeek

Name and Version

$ /mnt/nvme/llama-server/llama-server-be0e35 --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: Tesla P40, compute capability 6.1, VMM: yes
  Device 2: Tesla P40, compute capability 6.1, VMM: yes
  Device 3: Tesla P40, compute capability 6.1, VMM: yes
version: 4187 (be0e350c)
built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Problem description & steps to reproduce

There is a pretty consistent 16% tokens/second performance drop when using --cache-type-k q8_0 --cache-type-v q8_0 with a draft model. It doesn't happen if I don't use a draft model.

model python typescript swift
qwen-coder-32b-q4 79.9 54.48 46.67
qwen-coder-32b-q4-kv 66.60 (-16.6%) 45.27 (-16%) 39.24 (-15.9%)

The test I am using is to prompt for a snake game to be written in python, typescript and swift.

for model in "qwen-coder-32b-q4" "qwen-coder-32b-q4-kv"; do 
    for lang in "python" "typescript" "swift"; do
        echo "Generating Snake Game in $lang using $model"
        curl -s --url http://localhost:8080/v1/chat/completions -d "{\"messages\": [{\"role\": \"system\", \"content\": \"you only write code.\"}, {\"role\": \"user\", \"content\": \"write snake game in $lang\"}], \"temperature\": 0.1, \"model\":\"$model\"}" > /dev/null
    done
done

Here is my llama-swap configuration for the models above. The only difference is the addition of the cache-type flags:

models:
  "qwen-coder-32b-q4":
    # main model on 3090, draft on P40 #1
    #
    # gist results: python: 79.97 tps, typescript: 54.48 tps, swift: 46.67 tps
    cmd: >
      /mnt/nvme/llama-server/llama-server-be0e35
      --host 127.0.0.1 --port 9503
      --flash-attn --metrics
      --slots
      --model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
      -ngl 99
      --ctx-size 19000
      --model-draft /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf
      -ngld 99
      --draft-max 16
      --draft-min 4
      --draft-p-min 0.4
      --device CUDA0
      --device-draft CUDA1
    proxy: "http://127.0.0.1:9503"

  "qwen-coder-32b-q4-kv":
    # main model on 3090, draft on P40 #1
    #
    # gist results: python: 66.60, typescript 45.27, swift 39.24
    cmd: >
      /mnt/nvme/llama-server/llama-server-be0e35
      --host 127.0.0.1 --port 9503
      --flash-attn --metrics
      --slots
      --model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
      -ngl 99
      --ctx-size 19000
      --model-draft /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf
      -ngld 99
      --draft-max 16
      --draft-min 4
      --draft-p-min 0.4
      --device CUDA0
      --device-draft CUDA1
      --cache-type-k q8_0 --cache-type-v q8_0        <-- ADDED THESE FLAGS
    proxy: "http://127.0.0.1:9503"

Raw test data:

# qwen-coder-32b-q4 (python, typescript, swift)
prompt eval time =      64.50 ms /     6 tokens (   10.75 ms per token,    93.02 tokens per second)
       eval time =   11264.22 ms /   900 tokens (   12.52 ms per token,    79.90 tokens per second)       
prompt eval time =      68.14 ms /     7 tokens (    9.73 ms per token,   102.73 tokens per second)
       eval time =   15766.30 ms /   859 tokens (   18.35 ms per token,    54.48 tokens per second)       
prompt eval time =      61.53 ms /     6 tokens (   10.25 ms per token,    97.51 tokens per second)
       eval time =   19349.37 ms /   903 tokens (   21.43 ms per token,    46.67 tokens per second)


# qwen-coder-32b-q4-kv (python, typescript, swift)
prompt eval time =      52.95 ms /    23 tokens (    2.30 ms per token,   434.37 tokens per second)
       eval time =   13513.06 ms /   900 tokens (   15.01 ms per token,    66.60 tokens per second)       
prompt eval time =      69.98 ms /     7 tokens (   10.00 ms per token,   100.03 tokens per second)
       eval time =   19462.99 ms /   881 tokens (   22.09 ms per token,    45.27 tokens per second)       
prompt eval time =      63.20 ms /     6 tokens (   10.53 ms per token,    94.94 tokens per second)
       eval time =   27041.86 ms /  1061 tokens (   25.49 ms per token,    39.24 tokens per second)

First Bad Commit

No response

Relevant log output

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions