Severe Main Thread Bottleneck

`llama-cpp-python` observes a severe bottleneck on the main python thread not otherwise present in `llama.cpp`

Running a server with `llama.cpp` directly using
```sh
./server -ngl 999 -m models/Meta-Llama-3-8B-Instruct.Q8_0.gguf --port 12345 -c 8192
```
The typical response speed is 70 t/s

Meanwhile, running a server with `llama-cpp-python` using
```sh
python -m llama_cpp.server --model models/Meta-Llama-3-8B-Instruct.Q8_0.gguf --n_ctx 8192 --n_gpu_layers 999 --port 12345
```
Results in a mere 35 t/s

This also applies to fatter models. FP16 Llama 3 is 35 t/s in `llama.cpp` while hitting only 24 t/s in `llama-cpp-python`. The backend thread block time appears to be consistently very long, resulting in a universal massive performance penalty.

In `htop` it can be observed that the `llama-cpp-python` server is completely pegging the main python process, while the GPU remains mostly idle. This is further confirmed by directly reading the kernel driver's GPU busy from
```sh
/sys/class/drm/card1/device/gpu_busy_percent
```
Which reads 99% for `llama.cpp` and only 55% for `llama-cpp-python`

Setup is a 7900 XTX GPU with a 7900X CPU @ 6 GHz with all the C libs compiled locally.

Possibly related to https://github.com/abetlen/llama-cpp-python/discussions/1376, at least part of the severe slowdown may derive from the grammar based on their numbers.

Potential duplicate of https://github.com/abetlen/llama-cpp-python/issues/1447, but the numbers presented there are extremely different from my own and without more information I believe there may be different issues at play there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Severe Main Thread Bottleneck #1452

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Severe Main Thread Bottleneck #1452

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions