Skip to content

Severe Main Thread Bottleneck #1452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Beinsezii opened this issue May 13, 2024 · 3 comments
Closed

Severe Main Thread Bottleneck #1452

Beinsezii opened this issue May 13, 2024 · 3 comments
Labels
bug Something isn't working performance

Comments

@Beinsezii
Copy link

llama-cpp-python observes a severe bottleneck on the main python thread not otherwise present in llama.cpp

Running a server with llama.cpp directly using

./server -ngl 999 -m models/Meta-Llama-3-8B-Instruct.Q8_0.gguf --port 12345 -c 8192

The typical response speed is 70 t/s

Meanwhile, running a server with llama-cpp-python using

python -m llama_cpp.server --model models/Meta-Llama-3-8B-Instruct.Q8_0.gguf --n_ctx 8192 --n_gpu_layers 999 --port 12345

Results in a mere 35 t/s

This also applies to fatter models. FP16 Llama 3 is 35 t/s in llama.cpp while hitting only 24 t/s in llama-cpp-python. The backend thread block time appears to be consistently very long, resulting in a universal massive performance penalty.

In htop it can be observed that the llama-cpp-python server is completely pegging the main python process, while the GPU remains mostly idle. This is further confirmed by directly reading the kernel driver's GPU busy from

/sys/class/drm/card1/device/gpu_busy_percent

Which reads 99% for llama.cpp and only 55% for llama-cpp-python

Setup is a 7900 XTX GPU with a 7900X CPU @ 6 GHz with all the C libs compiled locally.

Possibly related to #1376, at least part of the severe slowdown may derive from the grammar based on their numbers.

Potential duplicate of #1447, but the numbers presented there are extremely different from my own and without more information I believe there may be different issues at play there.

@abetlen abetlen added bug Something isn't working performance labels May 13, 2024
@abetlen
Copy link
Owner

abetlen commented May 13, 2024

@Beinsezii thanks for reporting this, I'll take a look, to confirm this is when calling without any grammar constraint correct?

I'll start to look into this but the best way to debug in the past has been via py-spy with the --native flag to get a broad idea of where the thread is spending most of it's time outside of llama.cpp and then using line_profiler to narrow down the exact lines that are causing the issue. Usually it's been an unnecessary dynamic memory allocation that we can usually pre-allocate at the Llama instance level or something like that.

@Beinsezii
Copy link
Author

Beinsezii commented May 13, 2024

I modified the minimal repro from the grammar discussion and found even without any grammar files loaded, I peak at around 70% GPU busy percent.

from llama_cpp.llama import Llama


def formatMessages(messages):
    prompt = ""
    lastRole = "system"

    for message in messages:
        prompt += message["role"] + ":\n"
        if message["role"] != lastRole:
            prompt += "\n"
        prompt += message["content"] + "\n"
        lastRole = message["role"]

    prompt += "assistant:\n"

    return prompt


llama_model = "/home/beinsezii/Python/llmodels/Meta-Llama-3-8B-Instruct.Q8_0.gguf"

llm = Llama(llama_model, n_ctx=8162, n_gpu_layers=999)

system_prompt = """You are a skilled writing assistant.
Write a story based on the user's prompt
Always output your answer as JSON
"""

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "a story about bees"},
]


# Only 70% GPU
response = llm(
    formatMessages(messages),
    max_tokens=1024,
    repeat_penalty=1.2,
    temperature=1.0,
    top_k=1000,
)

# # Also 70% GPU
# response = llm.create_chat_completion(
#     messages=messages,
#     temperature=0.7,
# )

# # Down to 30% GPU
# response = llm.create_chat_completion(
#     messages=messages,
#     response_format={
#         "type": "json_object",
#     },
#     temperature=0.7,
# )
#

PySpy doesn't actually work in my env... Let me see what else I can find.

Update: tried the built-in cProfile but its not as helpful it seems.
cprofile.txt

@Beinsezii
Copy link
Author

Beinsezii commented May 13, 2024

Interestingly when I use the native minimal web interface for llama.cpp's server binary, I'm down to about 70% busy as well. Connecting to that exact same server instance with something like SillyTavern and I can use the full ≈99%. Same with the llama-bench binary. Might be a coincidence but interesting that it's capped at about the same GPU busy %

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance
Projects
None yet
Development

No branches or pull requests

2 participants