Skip to content

Severe Main Thread Bottleneck #1452

Closed
Closed
@Beinsezii

Description

@Beinsezii

llama-cpp-python observes a severe bottleneck on the main python thread not otherwise present in llama.cpp

Running a server with llama.cpp directly using

./server -ngl 999 -m models/Meta-Llama-3-8B-Instruct.Q8_0.gguf --port 12345 -c 8192

The typical response speed is 70 t/s

Meanwhile, running a server with llama-cpp-python using

python -m llama_cpp.server --model models/Meta-Llama-3-8B-Instruct.Q8_0.gguf --n_ctx 8192 --n_gpu_layers 999 --port 12345

Results in a mere 35 t/s

This also applies to fatter models. FP16 Llama 3 is 35 t/s in llama.cpp while hitting only 24 t/s in llama-cpp-python. The backend thread block time appears to be consistently very long, resulting in a universal massive performance penalty.

In htop it can be observed that the llama-cpp-python server is completely pegging the main python process, while the GPU remains mostly idle. This is further confirmed by directly reading the kernel driver's GPU busy from

/sys/class/drm/card1/device/gpu_busy_percent

Which reads 99% for llama.cpp and only 55% for llama-cpp-python

Setup is a 7900 XTX GPU with a 7900X CPU @ 6 GHz with all the C libs compiled locally.

Possibly related to #1376, at least part of the severe slowdown may derive from the grammar based on their numbers.

Potential duplicate of #1447, but the numbers presented there are extremely different from my own and without more information I believe there may be different issues at play there.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions