llama cpp python server for llava slow token per second #1354

Kev1ntan · 2024-04-18T10:32:53Z

Darwin Feedloops-Mac-Studio-2.local 23.3.0 Darwin Kernel Version 23.3.0: Wed Dec 20 21:31:00 PST 2023; root:xnu-10002.81.5~7/RELEASE_ARM64_T6020 arm64

command: python -m llama_cpp.server --model ./llava-v1.6-mistral-7b.Q8_0.gguf --port 9007 --host localhost --n_gpu_layers 33 --chat_format chatml --clip_model_path ./mmproj-mistral7b-f16.gguf

curl --location 'http://localhost:9007/v1/chat/completions'
--header 'Authorization: Bearer 1n66q24dexb1cc8abc62b185dee0dd802pn92'
--header 'Content-Type: application/json'
--data '{
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "hello"
}
]
}
],
"max_tokens": 1000,
"temperature": 0,
}'

INFO: Started server process [71075]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://localhost:9007 (Press CTRL+C to quit)

llama_print_timings: load time = 1491.98 ms
llama_print_timings: sample time = 2.17 ms / 26 runs ( 0.08 ms per token, 12009.24 tokens per second)
llama_print_timings: prompt eval time = 1491.90 ms / 37 tokens ( 40.32 ms per token, 24.80 tokens per second)
llama_print_timings: eval time = 66226.55 ms / 25 runs ( 2649.06 ms per token, 0.38 tokens per second)
llama_print_timings: total time = 67791.77 ms / 62 tokens
INFO: ::1:55485 - "POST /v1/chat/completions HTTP/1.1" 200 OK

can someone help? thanks

kinchahoy · 2024-04-22T17:18:02Z

I ran into this also. I think I know what the problem is.

The CMake config in LLama.cpp is currently not optimizing for native architectures to fix an issue with MOE (ggml-org/llama.cpp#6716). You'll get much slower performance with CMake on LLama.cpp right now. It took me a while to realize that Llama-cpp-python uses the CMake build pathway in LLama.cpp not Make.

An easy fix is to install llama-cpp-python from the repo, then edit extern/llama.cpp (where Llama-cpp-python downloads llama.cpp as a sub-repo) and modify the CMakeLists.txt to say set(LLAMA_LLAMAFILE_DEFAULT ON)

I gained a ton of performance that way.

shelbywhite · 2024-04-23T23:44:25Z

An easy fix is to install llama-cpp-python from the repo, then edit extern/llama.cpp (where Llama-cpp-python downloads llama.cpp as a sub-repo) and modify the CMakeLists.txt to say set(LLAMA_LLAMAFILE_DEFAULT ON)

@kinchahoy do you by chance know the tokens/sec performance before applying your fix and then after?

kinchahoy · 2024-04-23T23:58:28Z

I'm doing kinda complex things but it took something that was taking 24s to encode down to 11-12s.

Kev1ntan changed the title ~~llama cpp server for llava slow token per second~~ llama cpp python server for llava slow token per second Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama cpp python server for llava slow token per second #1354

llama cpp python server for llava slow token per second #1354

Kev1ntan commented Apr 18, 2024

kinchahoy commented Apr 22, 2024

Uh oh!

shelbywhite commented Apr 23, 2024

Uh oh!

kinchahoy commented Apr 23, 2024

Uh oh!

llama cpp python server for llava slow token per second #1354

llama cpp python server for llava slow token per second #1354

Comments

Kev1ntan commented Apr 18, 2024

kinchahoy commented Apr 22, 2024

Uh oh!

shelbywhite commented Apr 23, 2024

Uh oh!

kinchahoy commented Apr 23, 2024

Uh oh!