Skip to content

Huge difference in performance between llama.cpp and llama-cpp-python #1447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kseyhan opened this issue May 10, 2024 · 8 comments
Closed

Huge difference in performance between llama.cpp and llama-cpp-python #1447

kseyhan opened this issue May 10, 2024 · 8 comments
Labels
bug Something isn't working performance

Comments

@kseyhan
Copy link

kseyhan commented May 10, 2024

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [ X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [ X] I carefully followed the README.md.
  • [ X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [ X] I reviewed the Discussions, and have a new bug or useful enhancement to share.

I'm running a bot on Libera IRC and the difference between llama.cpp's response time compared to the llama-cpp-python one is pretty huge when maxing out the context lenght.

this is how i run llama.cpp which with the latest update results in a response time of 3 seconds for my bot.
./server -t 8 -a llama-3-8b-instruct -m ./Meta-Llama-3-8B-Instruct-Q6_K.gguf -c 8192 -ngl 100 --timeout 10

this is how i run llama-cpp-python which results in a response time of 18 seconds for my bot
python3 -m llama_cpp.server --model ./Meta-Llama-3-8B-Instruct-Q6_K.gguf --n_threads 8 --n_gpu_layers -1 --n_ctx 8192

Am i doing something wrong or is this normal?

Environment and Context

i experienced that behaviour on linux and windows if self compiled or using the pre compiled wheels

  • Physical (or virtual) hardware you are using, e.g. for Linux:
    CPU: Model name: 13th Gen Intel(R) Core(TM) i5-13600K
    GPU: VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090]

  • Operating System, e.g. for Linux i'm at right now:

Linux b6.8.8-300.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Apr 27 17:53:31 UTC 2024 x86_64 GNU/Linu

  • SDK version, e.g. for Linux:
$ python3 --version = Python 3.11.9
$ make --version = GNU Make 4.4.1
$ g++ --version = g++ (GCC) 14.0.1 20240411 (Red Hat 14.0.1-0) 
nvcc makes use of gcc 13 = g++-13 (Homebrew GCC 13.2.0) 13.2.0
export NVCC_PREPEND_FLAGS='-ccbin /home/linuxbrew/.linuxbrew/bin/g++-13'
@abetlen abetlen added bug Something isn't working performance labels May 13, 2024
@nanafy
Copy link

nanafy commented May 15, 2024

I can also now confirm. Have been using this repo extensively since its inception. Really awesome, appreciate @abetlen and all the others for making this software. I was looking at all the tickets mentioning this speed inconsistency between llama.cpp native and llama-cpp python. I tried loading the Meta llama 3 8B variant on both programs, same init settings. Unfortunately llama-cpp speed up is incredibly noticeable.

I can help debug in any way possible, just let me know what would be good information to relay to the repo contributors. I am using a 3060 gpu on windows 10, both variants (llama cpp, llama-cpp python) were gpu enabled with maximum gpu offloading.

@kseyhan
Copy link
Author

kseyhan commented May 19, 2024

i can also supply a database with test data for better reproduction if there is any need for it. the decrease in speed is increasing with the context lenght. thats so far my observation.

@qnixsynapse
Copy link

I can confirm this. Even without maxing out the context length, the performance difference is noticeable.

@mdte123
Copy link

mdte123 commented Jul 6, 2024

Hi, I've probably been struggling with this for the last day too.

I did find that setting the logits_all parameter to false (its true by default) appeared to increase the toks/second from about 8 to about 23 on a machine I have that is stuffed with old nvidia gaming cards. 23 toks/second what what I was getting running the llama.cpp inferencing directly.

I have no idea what logits are as I am a bit new to this. But, at least it's something to try out.

The logits_all parameter is a model setting in my OpenAI-Like server configuration file. No doubt, there is also a command line option for it too.

If these mysterious logits do turn out to be necessary for something, then I guess I will add another almost identical model in my configuration file with them turned on.

@qnixsynapse
Copy link

I have no idea what logits are as I am a bit new to this. But, at least it's something to try out.

Is this option by default turned on? (It shouldn't be) because for inference we only need the logits of the last token.

@mdte123
Copy link

mdte123 commented Jul 6, 2024

It is on by default according to this page:

https://llama-cpp-python.readthedocs.io/en/latest/server/#llama_cpp.server.settings.ModelSettings

And thats what my experiment confirmed.

Thank you.

@qnixsynapse
Copy link

Thank you! Now it all makes sense.

@kseyhan
Copy link
Author

kseyhan commented Jul 9, 2024

well, just want to report that i just returned after some abstinence to play arround with my bot again and it responds in 4-5 seconds with a completely filled context for me right now. So i actually cant tell how and why it got fixed but it seems fixed for me using the same config as i had before. i'm closing this as fixed now.

@kseyhan kseyhan closed this as completed Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance
Projects
None yet
Development

No branches or pull requests

5 participants