Skip to content

llama cpp python server for llava slow token per second #1354

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Kev1ntan opened this issue Apr 18, 2024 · 3 comments
Open

llama cpp python server for llava slow token per second #1354

Kev1ntan opened this issue Apr 18, 2024 · 3 comments

Comments

@Kev1ntan
Copy link

Darwin Feedloops-Mac-Studio-2.local 23.3.0 Darwin Kernel Version 23.3.0: Wed Dec 20 21:31:00 PST 2023; root:xnu-10002.81.5~7/RELEASE_ARM64_T6020 arm64

command: python -m llama_cpp.server --model ./llava-v1.6-mistral-7b.Q8_0.gguf --port 9007 --host localhost --n_gpu_layers 33 --chat_format chatml --clip_model_path ./mmproj-mistral7b-f16.gguf

curl --location 'http://localhost:9007/v1/chat/completions'
--header 'Authorization: Bearer 1n66q24dexb1cc8abc62b185dee0dd802pn92'
--header 'Content-Type: application/json'
--data '{
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "hello"
}
]
}
],
"max_tokens": 1000,
"temperature": 0,
}'

INFO: Started server process [71075]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://localhost:9007 (Press CTRL+C to quit)

llama_print_timings: load time = 1491.98 ms
llama_print_timings: sample time = 2.17 ms / 26 runs ( 0.08 ms per token, 12009.24 tokens per second)
llama_print_timings: prompt eval time = 1491.90 ms / 37 tokens ( 40.32 ms per token, 24.80 tokens per second)
llama_print_timings: eval time = 66226.55 ms / 25 runs ( 2649.06 ms per token, 0.38 tokens per second)
llama_print_timings: total time = 67791.77 ms / 62 tokens
INFO: ::1:55485 - "POST /v1/chat/completions HTTP/1.1" 200 OK

can someone help? thanks

@Kev1ntan Kev1ntan changed the title llama cpp server for llava slow token per second llama cpp python server for llava slow token per second Apr 19, 2024
@kinchahoy
Copy link

I ran into this also. I think I know what the problem is.

The CMake config in LLama.cpp is currently not optimizing for native architectures to fix an issue with MOE (ggml-org/llama.cpp#6716). You'll get much slower performance with CMake on LLama.cpp right now. It took me a while to realize that Llama-cpp-python uses the CMake build pathway in LLama.cpp not Make.

An easy fix is to install llama-cpp-python from the repo, then edit extern/llama.cpp (where Llama-cpp-python downloads llama.cpp as a sub-repo) and modify the CMakeLists.txt to say set(LLAMA_LLAMAFILE_DEFAULT ON)

I gained a ton of performance that way.

@shelbywhite
Copy link

An easy fix is to install llama-cpp-python from the repo, then edit extern/llama.cpp (where Llama-cpp-python downloads llama.cpp as a sub-repo) and modify the CMakeLists.txt to say set(LLAMA_LLAMAFILE_DEFAULT ON)

@kinchahoy do you by chance know the tokens/sec performance before applying your fix and then after?

@kinchahoy
Copy link

I'm doing kinda complex things but it took something that was taking 24s to encode down to 11-12s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants