-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Severe Main Thread Bottleneck #1452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@Beinsezii thanks for reporting this, I'll take a look, to confirm this is when calling without any grammar constraint correct? I'll start to look into this but the best way to debug in the past has been via |
I modified the minimal repro from the grammar discussion and found even without any grammar files loaded, I peak at around 70% GPU busy percent. from llama_cpp.llama import Llama
def formatMessages(messages):
prompt = ""
lastRole = "system"
for message in messages:
prompt += message["role"] + ":\n"
if message["role"] != lastRole:
prompt += "\n"
prompt += message["content"] + "\n"
lastRole = message["role"]
prompt += "assistant:\n"
return prompt
llama_model = "/home/beinsezii/Python/llmodels/Meta-Llama-3-8B-Instruct.Q8_0.gguf"
llm = Llama(llama_model, n_ctx=8162, n_gpu_layers=999)
system_prompt = """You are a skilled writing assistant.
Write a story based on the user's prompt
Always output your answer as JSON
"""
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": "a story about bees"},
]
# Only 70% GPU
response = llm(
formatMessages(messages),
max_tokens=1024,
repeat_penalty=1.2,
temperature=1.0,
top_k=1000,
)
# # Also 70% GPU
# response = llm.create_chat_completion(
# messages=messages,
# temperature=0.7,
# )
# # Down to 30% GPU
# response = llm.create_chat_completion(
# messages=messages,
# response_format={
# "type": "json_object",
# },
# temperature=0.7,
# )
# PySpy doesn't actually work in my env... Let me see what else I can find. Update: tried the built-in cProfile but its not as helpful it seems. |
Interestingly when I use the native minimal web interface for |
llama-cpp-python
observes a severe bottleneck on the main python thread not otherwise present inllama.cpp
Running a server with
llama.cpp
directly usingThe typical response speed is 70 t/s
Meanwhile, running a server with
llama-cpp-python
usingResults in a mere 35 t/s
This also applies to fatter models. FP16 Llama 3 is 35 t/s in
llama.cpp
while hitting only 24 t/s inllama-cpp-python
. The backend thread block time appears to be consistently very long, resulting in a universal massive performance penalty.In
htop
it can be observed that thellama-cpp-python
server is completely pegging the main python process, while the GPU remains mostly idle. This is further confirmed by directly reading the kernel driver's GPU busy fromWhich reads 99% for
llama.cpp
and only 55% forllama-cpp-python
Setup is a 7900 XTX GPU with a 7900X CPU @ 6 GHz with all the C libs compiled locally.
Possibly related to #1376, at least part of the severe slowdown may derive from the grammar based on their numbers.
Potential duplicate of #1447, but the numbers presented there are extremely different from my own and without more information I believe there may be different issues at play there.
The text was updated successfully, but these errors were encountered: