Skip to content

Only concatenate after all batches are done #420

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 26, 2023
Merged

Conversation

samfundev
Copy link
Contributor

@samfundev samfundev commented Jun 24, 2023

Based on the discussion in #398, I was able to make Llama.eval faster by only calling np.concatenate once all the batches are done.

Performance measurements based on the tester included in the original issue.

Master: (~3137.75ms overhead)

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      501    2.977    0.006   87.771    0.175 llama.py:400(eval)

This PR: (~350.9ms overhead)

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      501    0.041    0.000   84.657    0.169 llama.py:400(eval)

@shouyiwang
Copy link

shouyiwang commented Jun 26, 2023

It works!

I discovered that when I used my original test scripts and opted for the 7b model instead of the 33b model, and increased the length of the generation to a larger number such as 1500, the discrepancy became much more pronounced. And please note that the discrepancy only becomes apparent if your GPU is powerful enough. For example, if I limit the power of my 4090 to 150W (3060 level performance), the speed difference between llama.cpp and llama-cpp-python is very small.

I tried testing the llama.cpp, the main branch and this PR on a 4090 with a 7b Q6_K model, and here are the results:

llama.cpp:

llama_print_timings:      sample time =   622.44 ms /  1500 runs   (    0.41 ms per token,  2409.87 tokens per second)
llama_print_timings: prompt eval time =   686.74 ms /   760 tokens (    0.90 ms per token,  1106.68 tokens per second)
llama_print_timings:        eval time = 13915.99 ms /  1498 runs   (    9.29 ms per token,   107.65 tokens per second)
llama_print_timings:       total time = 15485.60 ms

Main branch:

llama_print_timings:      sample time =   624.38 ms /  1499 runs   (    0.42 ms per token,  2400.77 tokens per second)
llama_print_timings: prompt eval time =    81.84 ms /     9 tokens (    9.09 ms per token,   109.96 tokens per second)
llama_print_timings:        eval time = 14081.58 ms /  1498 runs   (    9.40 ms per token,   106.38 tokens per second)
llama_print_timings:       total time = 31905.15 ms

This PR:

llama_print_timings:      sample time =   622.46 ms /  1499 runs   (    0.42 ms per token,  2408.17 tokens per second)
llama_print_timings: prompt eval time =    82.13 ms /     9 tokens (    9.13 ms per token,   109.59 tokens per second)
llama_print_timings:        eval time = 13908.38 ms /  1498 runs   (    9.28 ms per token,   107.70 tokens per second)
llama_print_timings:       total time = 16918.75 ms

This PR improved the total time from 31.9s to 16.9s. Well done!
Thank you so much for greatly improving the speed!

@shouyiwang
Copy link

GPU (4090) utilization every 0.5 seconds for the above test:

llama.cpp:
0%, 0%, 2%, 20%, 51%, 87%, 86%, 87%, 87%, 87%, 87%, 87%, 88%, 88%, 87%, 88%, 88%, 88%, 88%, 88%, 88%, 88%, 89%, 89%, 89%, 89%, 89%, 89%, 89%, 89%, 98%, 48%, 0%, 0%, 0%, 0%,

Main branch:
0%, 0%, 0%, 5%, 56%, 12%, 74%, 69%, 64%, 63%, 60%, 56%, 56%, 53%, 50%, 49%, 47%, 49%, 46%, 45%, 46%, 43%, 42%, 40%, 40%, 42%, 40%, 40%, 39%, 38%, 36%, 37%, 36%, 38%, 37%, 37%, 34%, 33%, 37%, 33%, 33%, 33%, 35%, 33%, 33%, 33%, 32%, 34%, 31%, 29%, 30%, 34%, 30%, 34%, 32%, 31%, 34%, 30%, 31%, 30%, 33%, 0%, 0%,

This PR:
0%, 0%, 5%, 49%, 44%, 78%, 77%, 78%, 78%, 78%, 77%, 77%, 77%, 78%, 77%, 77%, 77%, 77%, 78%, 78%, 76%, 76%, 77%, 77%, 76%, 76%, 76%, 77%, 77%, 77%, 77%, 76%, 76%, 45%, 0%, 0%,

This PR doesn't completely solve the problem, but it's a significant improvement.

@abetlen abetlen merged commit 04d9218 into abetlen:main Jun 26, 2023
@abetlen
Copy link
Owner

abetlen commented Jun 26, 2023

@samfundev great catch, I'll go through profiling the eval again this week!

antoine-lizee pushed a commit to antoine-lizee/llama-cpp-python that referenced this pull request Oct 30, 2023
Sets console codepage to 65001 (CP_UTF8) on start for both input and output, should fix problems with UTF-8 characters.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants