Huge performance discrepency between llama-cpp-python and llama.cpp #398

shouyiwang · 2023-06-19T06:17:12Z

Summary:

When testing the latest version of llama-cpp-python (0.1.64) alongside the corresponding commit of llama.cpp, I observed that llama.cpp performs significantly faster than llama-cpp-python in terms of total time taken to execute. Additionally, GPU utilization is consistently higher for llama.cpp compared to llama-cpp-python.

Environment:

Processor: AMD Ryzen 5 5600
GPU: NVIDIA 4090 (Single)
OS: Ubuntu 22.04
Python: 3.10.9
GNU Make: 4.3
Compiler: g++ 11.3.0

Background

First, I updated the textgen-webui requirement to include the latest version of llama-cpp-python (0.1.64) manually. After installing the update, I ran tests and saw that the speed improved, but it was still much slower than llama.cpp.

To focus on llama-cpp-python's role, I wrote code to test llama-cpp-python separately.

Steps to Reproduce:

llama-cpp-python

pip uninstall -y llama-cpp-python
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.64 --no-cache-dir

conda list llama-cpp-python
make sure the version is 0.1.64
Write a test.py file using the following code, change the model file to any GGML model on your local machine, then Python test.py.

import sys
from llama_cpp import Llama

params = {
    'model_path': "/home/wsy/Projects/text-generation-webui/models/guanaco-33B.ggmlv3.q4_K_M.bin",
    'n_ctx': 1024,
    'seed': 4,
    'n_threads': 1,
    'n_batch': 256,
    'n_gpu_layers': 128
}

llm = Llama(**params)

stream = llm(
    "Write an essay about american history",
    max_tokens=1000,
    stream=True,
)

for output in stream:
    print(output['choices'][0]['text'], end ='')
    sys.stdout.flush()

llama.cpp

Go to llama.cpp folder,

git pull
git checkout 8596af427722775f0df4a7c90b9af067ba90d4ef
make clean
make LLAMA_CUBLAS=1

Run llama.cpp with the exact same parameters using the following command:

./main -m ../models/guanaco-33B.ggmlv3.q4_K_M.bin -p "Write an essay about american history" -ngl 128 -s 4 -n 1000 -t 1 --ctx-size 1024 --batch-size 256

Expected Outcome:

Similar performance and GPU utilization between llama-cpp-python and llama.cpp.

Actual Outcome:

Output of llama-cpp-python:

llama_print_timings:        load time =   450.16 ms
llama_print_timings:      sample time =   412.64 ms /  1000 runs   (    0.41 ms per token)
llama_print_timings: prompt eval time =   450.12 ms /     9 tokens (   50.01 ms per token)
llama_print_timings:        eval time = 30622.88 ms /   999 runs   (   30.65 ms per token)
llama_print_timings:       total time = 39541.67 ms

Output of llama.cpp:

llama_print_timings:        load time =  2480.53 ms
llama_print_timings:      sample time =   426.18 ms /  1000 runs   (    0.43 ms per token)
llama_print_timings: prompt eval time =   447.96 ms /     9 tokens (   49.77 ms per token)
llama_print_timings:        eval time = 29871.26 ms /   999 runs   (   29.90 ms per token)
llama_print_timings:       total time = 30938.72 ms

llama.cpp had a total execution time that was almost 9 seconds faster than llama-cpp-python (about 28% faster).
GPU utilization was constant at around 93% for llama.cpp, while it started at around 80% and gradually dropped to below 60% for llama-cpp-python, which might be indicative of the performance discrepancy.
In llama-cpp-python, the total time is significantly larger than the sum of sample time + prompt eval time + eval time. In contrast, these times are consistent for llama.cpp.

Updated Findings

I conducted more tests and discovered additional facts that could be useful in solving the problem:

0.1.64 is significantly faster than 0.1.63, which indicates that the llama.cpp code has indeed been updated.
Earlier versions, such as 0.1.61, also suffer from the total time != sample time + prompt eval time + eval time issue.

It seems that the problem has existed for quite some time. When llama.cpp was slow, it wasn't very noticeable, but now that llama.cpp is fast, it is much more evident.

I would appreciate it if this performance discrepancy could be investigated and addressed. Thank you!

The text was updated successfully, but these errors were encountered:

johncronan · 2023-06-21T04:17:49Z

Your timings show that the issue is with the load time. It's confounded by the issue with the timings not getting reported correctly, but there's no other possibility.

llama-cpp-python and llama.cpp's main program both just call llama_init_from_file to do the work of loading. I recommend you collect timings from llama-cpp-python's __init__ method, here. It's possible that something else, besides that call, is making a slow down there.

My guess, however, would be that something a bit more complicated is going on, to slow down the file I/O within the library.

shouyiwang · 2023-06-21T04:50:07Z

@johncronan
Thank you for the insight! I will check the code to find a possible cause of this issue.
I'm not very familiar with Python, but maybe this could be an opportunity for me to learn more about it :)

uogbuji · 2023-06-21T04:59:10Z

Just linking back to the last big performance effort, in #232. There was a lot of detailed info there which might be useful in figuring out where this lingering problem might lie.

shouyiwang · 2023-06-21T06:06:13Z

@johncronan
As GPU barely works when loading the model, I just wrote a code that outputs the GPU power usage by running "nvidia-smi" every 0.2 seconds during the tests. It shows the loading time for both cases is exactly the same.

The only difference I observed aligns with the GPU utilization that I reported in the bug report. When performing inference, the GPU consistently runs at full power when using llama.cpp, but there is a gradual drop in GPU utilization when using llama-cpp-python.

Based on these observations, I believe that the token generation timing reported by llama-cpp-python is incorrect.

LLama.cpp:
21W,21W,21W,21W,21W,21W,23W,34W,39W,58W,60W,61W,66W,70W,73W,75W,75W,74W,98W,146W,216W,286W,313W,320W,318W,317W,316W,318W,318W,316W,318W,318W,316W,317W,318W,317W,317W,318W,318W,316W,317W,318W,316W,317W,317W,317W,317W,318W,318W,316W,317W,318W,317W,317W,318W,317W,317W,318W,317W,317W,318W,318W,317W,318W,318W,318W,317W,317W,317W,317W,317W,317W,317W,318W,318W,317W,317W,317W,317W,317W,317W,317W,318W,317W,318W,317W,318W,317W,317W,317W,318W,317W,317W,317W,317W,317W,317W,317W,317W,317W,317W,317W,317W,317W,317W,317W,317W,317W,317W,317W,317W,317W,317W,318W,317W,317W,317W,317W,317W,317W,317W,257W,194W,130W,81W,78W,76W,76W,76W,76W

LLama-cpp-python:
21W,21W,21W,21W,21W,21W,21W,21W,26W,42W,44W,58W,62W,66W,71W,74W,75W,75W,74W,80W,127W,186W,256W,303W,319W,318W,318W,316W,317W,317W,317W,316W,317W,317W,315W,317W,316W,315W,317W,315W,316W,315W,314W,317W,316W,317W,316W,315W,317W,315W,314W,316W,313W,315W,315W,313W,312W,311W,310W,311W,313W,313W,312W,309W,305W,306W,303W,305W,306W,306W,307W,305W,305W,305W,305W,306W,304W,300W,301W,302W,301W,301W,301W,299W,297W,304W,297W,295W,288W,285W,289W,302W,310W,315W,312W,315W,312W,307W,296W,294W,290W,298W,294W,290W,294W,293W,295W,288W,288W,288W,286W,287W,290W,290W,290W,286W,285W,282W,284W,285W,288W,288W,284W,287W,282W,283W,280W,280W,283W,284W,281W,281W,285W,283W,280W,284W,281W,279W,280W,284W,283W,286W,283W,285W,284W,283W,279W,278W,233W,181W,128W,81W,79W,76W,75W

I power limited my 4090 to 320W.
The loading phase for both of them is exact 16 outputs, which is 16 * 0.2 = 3.2 seconds.
It is fast, that is because the 19GB model is cache in RAM already.

shouyiwang · 2023-06-21T06:15:11Z

And GPU usage every 0.2 seconds:
Llama.cpp:
0%, 0%, 0%, 0%, 0%, 0%, 4%, 4%, 27%, 57%, 57%, 57%, 39%, 2%, 21%, 97%, 91%, 92%, 92%, 92%, 92%, 92%, 92%, 92%, 92%, 93%, 92%, 92%, 92%, 92%, 92%, 92%, 92%, 92%, 93%, 92%, 93%, 92%, 93%, 92%, 93%, 92%, 93%, 92%, 92%, 93%, 92%, 92%, 93%, 92%, 92%, 92%, 93%, 93%, 92%, 92%, 92%, 92%, 92%, 93%, 93%, 93%, 93%, 93%, 93%, 93%, 93%, 93%, 92%, 92%, 92%, 93%, 93%, 93%, 92%, 92%, 93%, 93%, 92%, 93%, 93%, 92%, 93%, 92%, 93%, 93%, 93%, 93%, 92%, 93%, 93%, 58%, 95%, 14%, 0%, ^C0%, 0%, 0%, 0%,

Llama-cpp-python:
0%, 0%, 0%, 0%, 1%, 5%, 0%, 56%, 56%, 56%, 56%, 36%, 2%, 20%, 97%, 87%, 89%, 86%, 86%, 87%, 87%, 86%, 85%, 82%, 83%, 84%, 81%, 83%, 79%, 82%, 81%, 80%, 80%, 79%, 79%, 79%, 78%, 78%, 79%, 77%, 80%, 75%, 78%, 76%, 74%, 75%, 78%, 77%, 77%, 77%, 76%, 76%, 73%, 70%, 74%, 74%, 69%, 74%, 69%, 73%, 70%, 72%, 72%, 71%, 69%, 70%, 70%, 70%, 70%, 69%, 69%, 70%, 69%, 69%, 67%, 67%, 70%, 66%, 69%, 66%, 66%, 66%, 67%, 64%, 71%, 64%, 63%, 69%, 68%, 64%, 62%, 62%, 65%, 64%, 64%, 64%, 65%, 62%, 61%, 59%, 61%, 65%, 66%, 63%, 58%, 65%, 66%, 58%, 66%, 30%, 71%, 14%, 0%, 0%, 0%, ^C0%, 0%, 0%, 0%,

johncronan · 2023-06-21T15:07:54Z

That's pretty definitive, then. Sorry to lead you astray!

remixer-dec · 2023-06-21T23:56:51Z

I also noticed the difference in GPU usage on macos (entirely different gpu backend) with this wrapper, but only for a 30B model, around 50% GPU usage during inference. Also tried an example server from llama.cpp ./server, it has ~95% GPU usage most of the time.
No such problem for 7B/13B models.

elmisback · 2023-06-24T06:12:55Z

Suggested by Reddit user VoidStarFoo in https://www.reddit.com/r/LocalLLaMA/comments/14evg0g/comment/joxwqyh/?utm_source=reddit&utm_medium=web2x&context=3, forgive me if I'm misunderstanding or this is solved already.

llama-cpp-python/llama_cpp/llama.py

Line 433 in 3e7eae4

self._scores: npt.NDArray[np.single] = np.concatenate(

There's an np concatenate inside a loop. If I understand the timings in that thread, 17% of time was spent there in a test. np concatenate returns a whole new array, it doesn't update the old one in place, so if I understand correctly, there's unnecessary copying going on.

I don't know how the array works and don't have time to test this at the moment.

If this is actually the issue, fixing could use standard fixed-size array operations:

If you know the size the array is going to grow to, you can just pre-allocate all the space with np zeros where it's initialized and update in place with something like array[i, :] = new_row, I think. I don't really know numpy and the indexing might be different if I'm not understanding how the dimensions work, but there is definitely a way to update an array in place.

If you don't know what size it is going to grow to, you can at least pre-allocate it to some large size and then whenever it's about to overflow, create a new array that is twice as long, copy over everything so far into the first half, and keep updating in place.

Typical Python lists already do something like this, but it's my understanding that there's a reason for using np array in this case.

uogbuji · 2023-06-25T20:27:48Z

Wanted to chip in on #420
It would be good to see if this helps folks who've experienced that slowdown. Here's a quick recipe for trying it:

pip uninstall llama-cpp-python
git clone https://github.com/samfundev/llama-cpp-python.git llama-cpp-python-samfundev
cd llama-cpp-python-samfundev
# Grab vendor/llama.cpp
git submodule init
git submodule update
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install -U --no-cache-dir .

You might want to do python -c "import llama_cpp; print (llama_cpp.__file__)" so you can find the files you just built and be absolutely sure their timestamps are right (just check the container directory of that __init__.py.

I notice a small speedup (and I haven't formally benchmarked), but then again llama-cpp-python was never that much slower for me in the first place.

abetlen · 2023-06-29T05:02:46Z

@elmisback thanks for the suggestion here. I think I was able to improve this further by keeping scores in a single numpy array that's only allocated once when you instantiate the model, this avoids any resizing / re-allocations during sampling as well.

Let me know if sampling runs faster.

davideuler · 2024-02-15T17:24:17Z

I worked around the performance issue with llama-cpp-python by degrade it to 0.2.27 as discussed here:

#756

gjmulder added the performance label Jun 19, 2023

shouyiwang mentioned this issue Jun 20, 2023

Llama-cpp-python has some serious performance issues oobabooga/text-generation-webui#2788

Open

samfundev mentioned this issue Jun 24, 2023

Only concatenate after all batches are done #420

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Huge performance discrepency between llama-cpp-python and llama.cpp #398

Huge performance discrepency between llama-cpp-python and llama.cpp #398

shouyiwang commented Jun 19, 2023 •

edited

Loading

johncronan commented Jun 21, 2023

Uh oh!

shouyiwang commented Jun 21, 2023 •

edited

Loading

Uh oh!

uogbuji commented Jun 21, 2023

Uh oh!

shouyiwang commented Jun 21, 2023 •

edited

Loading

Uh oh!

shouyiwang commented Jun 21, 2023

Uh oh!

johncronan commented Jun 21, 2023

Uh oh!

remixer-dec commented Jun 21, 2023 •

edited

Loading

Uh oh!

elmisback commented Jun 24, 2023 •

edited

Loading

Uh oh!

uogbuji commented Jun 25, 2023

Uh oh!

abetlen commented Jun 29, 2023

Uh oh!

davideuler commented Feb 15, 2024

Uh oh!

Huge performance discrepency between llama-cpp-python and llama.cpp #398

Huge performance discrepency between llama-cpp-python and llama.cpp #398

Comments

shouyiwang commented Jun 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary:

Environment:

Background

Steps to Reproduce:

llama-cpp-python

llama.cpp

Expected Outcome:

Actual Outcome:

Updated Findings

johncronan commented Jun 21, 2023

Uh oh!

shouyiwang commented Jun 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

uogbuji commented Jun 21, 2023

Uh oh!

shouyiwang commented Jun 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shouyiwang commented Jun 21, 2023

Uh oh!

johncronan commented Jun 21, 2023

Uh oh!

remixer-dec commented Jun 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elmisback commented Jun 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

uogbuji commented Jun 25, 2023

Uh oh!

abetlen commented Jun 29, 2023

Uh oh!

davideuler commented Feb 15, 2024

Uh oh!

shouyiwang commented Jun 19, 2023 •

edited

Loading

shouyiwang commented Jun 21, 2023 •

edited

Loading

shouyiwang commented Jun 21, 2023 •

edited

Loading

remixer-dec commented Jun 21, 2023 •

edited

Loading

elmisback commented Jun 24, 2023 •

edited

Loading