-
Notifications
You must be signed in to change notification settings - Fork 12k
Misc. bug: The inference speed of llama-server is one-third of that of llama-cli #12171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you post the speed numbers by adding |
Of course! Your suggestion is an excellent debugging method indeed! llama-server -m /data4/qwen2.5-14b-deep-q4_k.gguf -fa -c 512 --temp 0.6 --top-p 0.95 -s 3047 -ngl 160 --host 0.0.0.0 -t 1 llama-server -m /data4/qwen2.5-14b-deep-q4_k.gguf -fa -c 512 --temp 0.6 --top-p 0.95 -s 3047 -ngl 160 --host 0.0.0.0 -t 1 -nkvo llama-cli -m /data4/qwen2.5-14b-deep-q4_k.gguf -cnv -p "You are a helpful assistant." -fa -c 512 --temp 0.6 --top-p 0.95 -s 3047 -if -mli -ngl 160 -t 1 llama-cli -m /data4/qwen2.5-14b-deep-q4_k.gguf -cnv -p "You are a helpful assistant." -fa -c 512 --temp 0.6 --top-p 0.95 -s 3047 -if -mli -ngl 160 -nkvo -t 1 |
I discovered another key point for reproducing the bug: When I compiled the program, I used: The GGML_OPENMP=OFF flag does not affect the performance of llama-cli, but it severely impacts the performance of llama-server. |
I didn't get - does it impact it in a negative or in a positive way when you use |
Let me summarize all the test data (note the two lines marked with asterisks*): Additional tests with GGML_OPENMP=ON (24-thread only): Observations: llama-cli and llama-svr appear to use thread pools differently. |
While conducting inference with Deepseek R1 671B Q4 #11397 (comment) , I discovered anomalous data in llama-server. I initially achieved up to 15t/s inference speed using llama-cli, but encountered a significant drop to ** #0.1t/s** when working with the API through llama-server. I subsequently replicated this performance degradation with the "-nkvo" parameter in multiple computing environments (compiled with GGML_OPENMP=OFF across all test systems). To demonstrate the issue more efficiently, I'm using the qwen2.5 14B model as an example here. |
Could it be related to the |
@jukofyork Thanks for your attention and suggestions. I recompiled and ran the program again. It seems to be unrelated to the options -DGGML_SCHED_MAX_COPIES and --override-tensor. It's only related to GGML_OPENMP. cmake -B build3 -DGGML_CUDA=ON -DGGML_BUILD_NUMBER=3 -DGGML_OPENMP=OFF
|
Name and Version
llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
version: b4819
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
version: b4819
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
In the same environment, with the same parameters, the inference speed of llama-server is one-third of that of llama-cli.
Command lines as follows:
llama-server -m /data4/qwen2.5-14b-deep-q4_k.gguf -fa -c 512 --temp 0.6 --top-p 0.95 -s 3047 -ngl 160 --host 0.0.0.0
This parameter configuration achieves an inference speed of: 25.87 t/s
llama-server -m /data4/qwen2.5-14b-deep-q4_k.gguf -fa -c 512 --temp 0.6 --top-p 0.95 -s 3047 -ngl 160 -nkvo --host 0.0.0.0
This parameter configuration achieves an inference speed of: 5.83 t/s
llama-cli -m /data4/qwen2.5-14b-deep-q4_k.gguf -cnv -p "You are a helpful assistant." -fa -c 512 --temp 0.6 --top-p 0.95 -s 3047 -if -mli -ngl 160 -nkvo
This parameter configuration achieves an inference speed of: 18.25 t/s
llama-cli -m /data4/qwen2.5-14b-deep-q4_k.gguf -cnv -p "You are a helpful assistant." -fa -c 512 --temp 0.6 --top-p 0.95 -s 3047 -if -mli -ngl 160
This parameter configuration achieves an inference speed of: 24.27 t/s
Although the -nkvo option leaves kv calculations on the CPU, which slows down inference speed, for example, llama-cli's speed drops from 24.27 to 18.25, which is expected behavior. However, enabling -nkvo in llama-server causes a greater-than-expected drop. On my other computers, it even drops to 0.x t/s. llama-server with -nkvo is expected to have inference speeds similar to 18.xx t/s. I have reproduced this issue in multiple computer environments. I suspect there is a bug in the thread pool usage of llama-server. Thank you for this project that allows me to run LLMs locally. Looking forward to fixing this bug.
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: