-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Open
Labels
Description
Name and Version
PS F:\llm\llama-b6568-bin-win-vulkan-x64> .\llama-cli.exe --version
load_backend: loaded RPC backend from F:\llm\llama-b6568-bin-win-vulkan-x64\ggml-rpc.dll
[2025-09-25 00:14:31.678][info][16388] [huya-helper.cpp:378#init_log] graphic-hook 64bit log init suceed.
exe:llama-cli.exe, pid:5032
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Quadro P620 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = Intel(R) UHD Graphics 630 (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from F:\llm\llama-b6568-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from F:\llm\llama-b6568-bin-win-vulkan-x64\ggml-cpu-haswell.dll
version: 6568 (f2a789e3)
built with clang version 19.1.5 for x86_64-pc-windows-msvc
PS F:\llm\llama-b6568-bin-win-cuda-12.4-x64> .\llama-cli.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: Quadro P620, compute capability 6.1, VMM: yes
load_backend: loaded CUDA backend from F:\llm\llama-b6568-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from F:\llm\llama-b6568-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from F:\llm\llama-b6568-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll
version: 6568 (f2a789e3)
built with clang version 19.1.5 for x86_64-pc-windows-msvc
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Problem description & steps to reproduce
Run and compare the prompt eval and eval speed between vulkan and cuda backends.
Vulkan:
84.29 tokens per second for prefill phase
34.19 tokens per second for decode phase
CUDA:
524.58 tokens per second for prefill phase
17.96 tokens per second for decode phase
Why dose the performance difference exist?
Could you please improve Vulkan's prefill speed and CUDA's decode speed?
@JohannesGaessler @slaren @0cc4m
PS F:\llm\llama-b6568-bin-win-vulkan-x64> .\llama-server.exe --model ..\Qwen2.5-1.5B-Instruct-Q4_K_M.gguf --no-mmap --jinja --verbose-prompt --host 0.0.0.0 --flash-attn on --cache-type-k f32 --cache-type-v f32 -ngl 100 --metrics
PS F:\llm\llama-b6568-bin-win-cuda-12.4-x64> .\llama-server.exe --model ..\Qwen2.5-1.5B-Instruct-Q4_K_M.gguf --no-mmap --jinja --verbose-prompt --host 0.0.0.0 --flash-attn on --cache-type-k f32 --cache-type-v f32 -ngl 100 --metrics
First Bad Commit
No response
Relevant log output
PS F:\llm\llama-b6568-bin-win-vulkan-x64> .\llama-server.exe --model ..\Qwen2.5-1.5B-Instruct-Q4_K_M.gguf --no-mmap --jinja --verbose-prompt --host 0.0.0.0 --flash-attn on --cache-type-k f32 --cache-type-v f32 -ngl 100 --metrics
main: server is listening on http://0.0.0.0:8080 - starting the main loop
srv update_slots: all slots are idle
srv params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 766
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 766, n_tokens = 766, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 766, n_tokens = 766
slot release: id 0 | task 0 | stop processing: n_past = 1459, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 9087.16 ms / 766 tokens ( 11.86 ms per token, 84.29 tokens per second)
eval time = 20297.20 ms / 694 tokens ( 29.25 ms per token, 34.19 tokens per second)
total time = 29384.35 ms / 1460 tokens
srv update_slots: all slots are idle
PS F:\llm\llama-b6568-bin-win-cuda-12.4-x64> .\llama-server.exe --model ..\Qwen2.5-1.5B-Instruct-Q4_K_M.gguf --no-mmap --jinja --verbose-prompt --host 0.0.0.0 --flash-attn on --cache-type-k f32 --cache-type-v f32 -ngl 100 --metrics
main: server is listening on http://0.0.0.0:8080 - starting the main loop
srv update_slots: all slots are idle
srv params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 766
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 766, n_tokens = 766, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 766, n_tokens = 766
slot release: id 0 | task 0 | stop processing: n_past = 1352, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 1460.21 ms / 766 tokens ( 1.91 ms per token, 524.58 tokens per second)
eval time = 32675.95 ms / 587 tokens ( 55.67 ms per token, 17.96 tokens per second)
total time = 34136.16 ms / 1353 tokens
srv update_slots: all slots are idle