Only concatenate after all batches are done #420

samfundev · 2023-06-24T20:04:26Z

Based on the discussion in #398, I was able to make Llama.eval faster by only calling np.concatenate once all the batches are done.

Performance measurements based on the tester included in the original issue.

Master: (~3137.75ms overhead)

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      501    2.977    0.006   87.771    0.175 llama.py:400(eval)

This PR: (~350.9ms overhead)

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      501    0.041    0.000   84.657    0.169 llama.py:400(eval)

shouyiwang · 2023-06-26T05:29:35Z

It works!

I discovered that when I used my original test scripts and opted for the 7b model instead of the 33b model, and increased the length of the generation to a larger number such as 1500, the discrepancy became much more pronounced. And please note that the discrepancy only becomes apparent if your GPU is powerful enough. For example, if I limit the power of my 4090 to 150W (3060 level performance), the speed difference between llama.cpp and llama-cpp-python is very small.

I tried testing the llama.cpp, the main branch and this PR on a 4090 with a 7b Q6_K model, and here are the results:

llama.cpp:

llama_print_timings:      sample time =   622.44 ms /  1500 runs   (    0.41 ms per token,  2409.87 tokens per second)
llama_print_timings: prompt eval time =   686.74 ms /   760 tokens (    0.90 ms per token,  1106.68 tokens per second)
llama_print_timings:        eval time = 13915.99 ms /  1498 runs   (    9.29 ms per token,   107.65 tokens per second)
llama_print_timings:       total time = 15485.60 ms

Main branch:

llama_print_timings:      sample time =   624.38 ms /  1499 runs   (    0.42 ms per token,  2400.77 tokens per second)
llama_print_timings: prompt eval time =    81.84 ms /     9 tokens (    9.09 ms per token,   109.96 tokens per second)
llama_print_timings:        eval time = 14081.58 ms /  1498 runs   (    9.40 ms per token,   106.38 tokens per second)
llama_print_timings:       total time = 31905.15 ms

This PR:

llama_print_timings:      sample time =   622.46 ms /  1499 runs   (    0.42 ms per token,  2408.17 tokens per second)
llama_print_timings: prompt eval time =    82.13 ms /     9 tokens (    9.13 ms per token,   109.59 tokens per second)
llama_print_timings:        eval time = 13908.38 ms /  1498 runs   (    9.28 ms per token,   107.70 tokens per second)
llama_print_timings:       total time = 16918.75 ms

This PR improved the total time from 31.9s to 16.9s. Well done!
Thank you so much for greatly improving the speed!

shouyiwang · 2023-06-26T06:07:53Z

GPU (4090) utilization every 0.5 seconds for the above test:

llama.cpp:
0%, 0%, 2%, 20%, 51%, 87%, 86%, 87%, 87%, 87%, 87%, 87%, 88%, 88%, 87%, 88%, 88%, 88%, 88%, 88%, 88%, 88%, 89%, 89%, 89%, 89%, 89%, 89%, 89%, 89%, 98%, 48%, 0%, 0%, 0%, 0%,

Main branch:
0%, 0%, 0%, 5%, 56%, 12%, 74%, 69%, 64%, 63%, 60%, 56%, 56%, 53%, 50%, 49%, 47%, 49%, 46%, 45%, 46%, 43%, 42%, 40%, 40%, 42%, 40%, 40%, 39%, 38%, 36%, 37%, 36%, 38%, 37%, 37%, 34%, 33%, 37%, 33%, 33%, 33%, 35%, 33%, 33%, 33%, 32%, 34%, 31%, 29%, 30%, 34%, 30%, 34%, 32%, 31%, 34%, 30%, 31%, 30%, 33%, 0%, 0%,

This PR:
0%, 0%, 5%, 49%, 44%, 78%, 77%, 78%, 78%, 78%, 77%, 77%, 77%, 78%, 77%, 77%, 77%, 77%, 78%, 78%, 76%, 76%, 77%, 77%, 76%, 76%, 76%, 77%, 77%, 77%, 77%, 76%, 76%, 45%, 0%, 0%,

This PR doesn't completely solve the problem, but it's a significant improvement.

abetlen · 2023-06-26T12:12:11Z

@samfundev great catch, I'll go through profiling the eval again this week!

Sets console codepage to 65001 (CP_UTF8) on start for both input and output, should fix problems with UTF-8 characters.

Only concatenate after all batches are done

d788fb4

uogbuji mentioned this pull request Jun 25, 2023

Huge performance discrepency between llama-cpp-python and llama.cpp #398

Open

abetlen merged commit 04d9218 into abetlen:main Jun 26, 2023

digiwombat mentioned this pull request Jun 28, 2023

Llama-cpp-python has some serious performance issues oobabooga/text-generation-webui#2788

Open

antoine-lizee pushed a commit to antoine-lizee/llama-cpp-python that referenced this pull request Oct 30, 2023

(Windows) Set console to UTF-8 on init (abetlen#420)

34ab526

Sets console codepage to 65001 (CP_UTF8) on start for both input and output, should fix problems with UTF-8 characters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Only concatenate after all batches are done #420

Only concatenate after all batches are done #420

Uh oh!

samfundev commented Jun 24, 2023 •

edited

Loading

Uh oh!

shouyiwang commented Jun 26, 2023 •

edited

Loading

Uh oh!

shouyiwang commented Jun 26, 2023

Uh oh!

abetlen commented Jun 26, 2023

Uh oh!

Uh oh!

Only concatenate after all batches are done #420

Only concatenate after all batches are done #420

Uh oh!

Conversation

samfundev commented Jun 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shouyiwang commented Jun 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shouyiwang commented Jun 26, 2023

Uh oh!

abetlen commented Jun 26, 2023

Uh oh!

Uh oh!

samfundev commented Jun 24, 2023 •

edited

Loading

shouyiwang commented Jun 26, 2023 •

edited

Loading