Description
llama-cpp-python
observes a severe bottleneck on the main python thread not otherwise present in llama.cpp
Running a server with llama.cpp
directly using
./server -ngl 999 -m models/Meta-Llama-3-8B-Instruct.Q8_0.gguf --port 12345 -c 8192
The typical response speed is 70 t/s
Meanwhile, running a server with llama-cpp-python
using
python -m llama_cpp.server --model models/Meta-Llama-3-8B-Instruct.Q8_0.gguf --n_ctx 8192 --n_gpu_layers 999 --port 12345
Results in a mere 35 t/s
This also applies to fatter models. FP16 Llama 3 is 35 t/s in llama.cpp
while hitting only 24 t/s in llama-cpp-python
. The backend thread block time appears to be consistently very long, resulting in a universal massive performance penalty.
In htop
it can be observed that the llama-cpp-python
server is completely pegging the main python process, while the GPU remains mostly idle. This is further confirmed by directly reading the kernel driver's GPU busy from
/sys/class/drm/card1/device/gpu_busy_percent
Which reads 99% for llama.cpp
and only 55% for llama-cpp-python
Setup is a 7900 XTX GPU with a 7900X CPU @ 6 GHz with all the C libs compiled locally.
Possibly related to #1376, at least part of the severe slowdown may derive from the grammar based on their numbers.
Potential duplicate of #1447, but the numbers presented there are extremely different from my own and without more information I believe there may be different issues at play there.