Skip to content

Misc. bug: Obvious performance downgrade between Vulkan and CUDA backend. #16230

@MaoJianwei

Description

@MaoJianwei

Name and Version

PS F:\llm\llama-b6568-bin-win-vulkan-x64> .\llama-cli.exe --version
load_backend: loaded RPC backend from F:\llm\llama-b6568-bin-win-vulkan-x64\ggml-rpc.dll
[2025-09-25 00:14:31.678][info][16388] [huya-helper.cpp:378#init_log] graphic-hook 64bit log init suceed.
exe:llama-cli.exe, pid:5032
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Quadro P620 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = Intel(R) UHD Graphics 630 (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from F:\llm\llama-b6568-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from F:\llm\llama-b6568-bin-win-vulkan-x64\ggml-cpu-haswell.dll
version: 6568 (f2a789e3)
built with clang version 19.1.5 for x86_64-pc-windows-msvc
PS F:\llm\llama-b6568-bin-win-cuda-12.4-x64> .\llama-cli.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Quadro P620, compute capability 6.1, VMM: yes
load_backend: loaded CUDA backend from F:\llm\llama-b6568-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from F:\llm\llama-b6568-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from F:\llm\llama-b6568-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll
version: 6568 (f2a789e3)
built with clang version 19.1.5 for x86_64-pc-windows-msvc

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-server

Problem description & steps to reproduce

Run and compare the prompt eval and eval speed between vulkan and cuda backends.

Vulkan:
84.29 tokens per second for prefill phase
34.19 tokens per second for decode phase

CUDA:
524.58 tokens per second for prefill phase
17.96 tokens per second for decode phase

Why dose the performance difference exist?
Could you please improve Vulkan's prefill speed and CUDA's decode speed?

@JohannesGaessler @slaren @0cc4m

PS F:\llm\llama-b6568-bin-win-vulkan-x64> .\llama-server.exe --model ..\Qwen2.5-1.5B-Instruct-Q4_K_M.gguf  --no-mmap --jinja --verbose-prompt    --host 0.0.0.0  --flash-attn on --cache-type-k f32 --cache-type-v f32 -ngl 100 --metrics
PS F:\llm\llama-b6568-bin-win-cuda-12.4-x64> .\llama-server.exe --model ..\Qwen2.5-1.5B-Instruct-Q4_K_M.gguf  --no-mmap --jinja --verbose-prompt    --host 0.0.0.0  --flash-attn on --cache-type-k f32 --cache-type-v f32 -ngl 100 --metrics

First Bad Commit

No response

Relevant log output

PS F:\llm\llama-b6568-bin-win-vulkan-x64> .\llama-server.exe --model ..\Qwen2.5-1.5B-Instruct-Q4_K_M.gguf  --no-mmap --jinja --verbose-prompt    --host 0.0.0.0  --flash-attn on --cache-type-k f32 --cache-type-v f32 -ngl 100 --metrics


main: server is listening on http://0.0.0.0:8080 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 766
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 766, n_tokens = 766, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 766, n_tokens = 766
slot      release: id  0 | task 0 | stop processing: n_past = 1459, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =    9087.16 ms /   766 tokens (   11.86 ms per token,    84.29 tokens per second)
       eval time =   20297.20 ms /   694 tokens (   29.25 ms per token,    34.19 tokens per second)
      total time =   29384.35 ms /  1460 tokens
srv  update_slots: all slots are idle
PS F:\llm\llama-b6568-bin-win-cuda-12.4-x64> .\llama-server.exe --model ..\Qwen2.5-1.5B-Instruct-Q4_K_M.gguf  --no-mmap --jinja --verbose-prompt    --host 0.0.0.0  --flash-attn on --cache-type-k f32 --cache-type-v f32 -ngl 100 --metrics


main: server is listening on http://0.0.0.0:8080 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 766
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 766, n_tokens = 766, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 766, n_tokens = 766
slot      release: id  0 | task 0 | stop processing: n_past = 1352, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =    1460.21 ms /   766 tokens (    1.91 ms per token,   524.58 tokens per second)
       eval time =   32675.95 ms /   587 tokens (   55.67 ms per token,    17.96 tokens per second)
      total time =   34136.16 ms /  1353 tokens
srv  update_slots: all slots are idle

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions