Huge difference in performance between llama.cpp and llama-cpp-python

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [ X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [ X] I carefully followed the [README.md](https://github.com/abetlen/llama-cpp-python/blob/main/README.md).
- [ X] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [ X] I reviewed the [Discussions](https://github.com/abetlen/llama-cpp-python/discussions), and have a new bug or useful enhancement to share.

I'm running a bot on Libera IRC and the difference between llama.cpp's response time compared to the llama-cpp-python one is pretty huge when maxing out the context lenght.

this is how i run llama.cpp which with the latest update results in a response time of 3 seconds for my bot.
`./server -t 8 -a llama-3-8b-instruct -m ./Meta-Llama-3-8B-Instruct-Q6_K.gguf  -c 8192 -ngl 100 --timeout 10`

this is how i run llama-cpp-python which results in a response time of 18 seconds for my bot
`python3 -m llama_cpp.server --model ./Meta-Llama-3-8B-Instruct-Q6_K.gguf --n_threads 8 --n_gpu_layers -1 --n_ctx 8192`

Am i doing something wrong or is this normal?

# Environment and Context

i experienced that behaviour on linux and windows if self compiled or using the pre compiled wheels

* Physical (or virtual) hardware you are using, e.g. for Linux:
CPU: Model name:             13th Gen Intel(R) Core(TM) i5-13600K
GPU: VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] 

* Operating System, e.g. for Linux i'm at right now:

Linux b6.8.8-300.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Apr 27 17:53:31 UTC 2024 x86_64 GNU/Linu

* SDK version, e.g. for Linux:

```
$ python3 --version = Python 3.11.9
$ make --version = GNU Make 4.4.1
$ g++ --version = g++ (GCC) 14.0.1 20240411 (Red Hat 14.0.1-0) 
nvcc makes use of gcc 13 = g++-13 (Homebrew GCC 13.2.0) 13.2.0
export NVCC_PREPEND_FLAGS='-ccbin /home/linuxbrew/.linuxbrew/bin/g++-13'
```




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Huge difference in performance between llama.cpp and llama-cpp-python #1447

Prerequisites

Environment and Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Huge difference in performance between llama.cpp and llama-cpp-python #1447

Description

Prerequisites

Environment and Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions