You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
INFO: Started server process [71075]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://localhost:9007 (Press CTRL+C to quit)
llama_print_timings: load time = 1491.98 ms
llama_print_timings: sample time = 2.17 ms / 26 runs ( 0.08 ms per token, 12009.24 tokens per second)
llama_print_timings: prompt eval time = 1491.90 ms / 37 tokens ( 40.32 ms per token, 24.80 tokens per second)
llama_print_timings: eval time = 66226.55 ms / 25 runs ( 2649.06 ms per token, 0.38 tokens per second)
llama_print_timings: total time = 67791.77 ms / 62 tokens
INFO: ::1:55485 - "POST /v1/chat/completions HTTP/1.1" 200 OK
can someone help? thanks
The text was updated successfully, but these errors were encountered:
Kev1ntan
changed the title
llama cpp server for llava slow token per second
llama cpp python server for llava slow token per second
Apr 19, 2024
I ran into this also. I think I know what the problem is.
The CMake config in LLama.cpp is currently not optimizing for native architectures to fix an issue with MOE (ggml-org/llama.cpp#6716). You'll get much slower performance with CMake on LLama.cpp right now. It took me a while to realize that Llama-cpp-python uses the CMake build pathway in LLama.cpp not Make.
An easy fix is to install llama-cpp-python from the repo, then edit extern/llama.cpp (where Llama-cpp-python downloads llama.cpp as a sub-repo) and modify the CMakeLists.txt to say set(LLAMA_LLAMAFILE_DEFAULT ON)
An easy fix is to install llama-cpp-python from the repo, then edit extern/llama.cpp (where Llama-cpp-python downloads llama.cpp as a sub-repo) and modify the CMakeLists.txt to say set(LLAMA_LLAMAFILE_DEFAULT ON)
@kinchahoy do you by chance know the tokens/sec performance before applying your fix and then after?
Darwin Feedloops-Mac-Studio-2.local 23.3.0 Darwin Kernel Version 23.3.0: Wed Dec 20 21:31:00 PST 2023; root:xnu-10002.81.5~7/RELEASE_ARM64_T6020 arm64
command: python -m llama_cpp.server --model ./llava-v1.6-mistral-7b.Q8_0.gguf --port 9007 --host localhost --n_gpu_layers 33 --chat_format chatml --clip_model_path ./mmproj-mistral7b-f16.gguf
curl --location 'http://localhost:9007/v1/chat/completions'
--header 'Authorization: Bearer 1n66q24dexb1cc8abc62b185dee0dd802pn92'
--header 'Content-Type: application/json'
--data '{
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "hello"
}
]
}
],
"max_tokens": 1000,
"temperature": 0,
}'
INFO: Started server process [71075]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://localhost:9007 (Press CTRL+C to quit)
llama_print_timings: load time = 1491.98 ms
llama_print_timings: sample time = 2.17 ms / 26 runs ( 0.08 ms per token, 12009.24 tokens per second)
llama_print_timings: prompt eval time = 1491.90 ms / 37 tokens ( 40.32 ms per token, 24.80 tokens per second)
llama_print_timings: eval time = 66226.55 ms / 25 runs ( 2649.06 ms per token, 0.38 tokens per second)
llama_print_timings: total time = 67791.77 ms / 62 tokens
INFO: ::1:55485 - "POST /v1/chat/completions HTTP/1.1" 200 OK
can someone help? thanks
The text was updated successfully, but these errors were encountered: