-
Notifications
You must be signed in to change notification settings - Fork 12k
llama : make quantize example up to 2.7x faster #3115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Oops, I forgot that Windows doesn't have pread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for this to be faster? I doubt it's the thread pool.
Edit: nvm, just saw this information is in the individual commits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this black magic !?!?
fast-quantize-0.mp4
Edit: using 10 out of 16 threads time goes down to 1.5s
Can we avoid the threadpool.h
stuff?
I don't think it brings much to the table in terms of performance, while it introduces too much C++-isms that I don't like
Also - where are the histograms?
I think the primary improvement is the relocation of |
This PR makes expensive quantizations (k_quants) slower. E.g., for |
Could you try commit 97563d3? I think the default threading strategy I came up with may not be ideal in many configurations. edit: 5 seconds? Quantizing 33B to Q4_0 on master takes two minutes on my hardware. Am I doing something wrong? |
I wanted a thread pool because trying to attach a debugger to quantize produces quite a lot of spam from all the thousands of threads being created - and trying to trace the execution of that mess is probably not fun. I used C++11 because it seemed like the most elegant solution, but I could remove the dependency on std::future and std::packaged_task if you'd like. |
Actually, I believe that's a increase of the thread count, because of the way the code was computing nthreads - going from 4x8 (32 total) to 4x10 (40 total). I think I'll change it to divide the command-line parameter too to avoid surprises. Having all of these extra threads only makes sense because of I/O overhead - one thread can still be doing work while another is blocked on a page fault. Asynchronous I/O would be nice, but that's easier said than done in the Unix world... |
Oops, sorry. This was 13B. For 30B and |
I suppose I'll have to do some tuning on different platforms. It seems like what applies to Linux on amd64 may not apply to macOS on aarch64, or Windows for that matter. One thing you could try is clearing your disk cache before each run - that's where the extra parallelization should help. Unfortunately, I don't have enough RAM to experiment with anything much larger than a 7B fp16 on ramfs. The choppiness you're seeing is I/O overhead - you don't notice it so much when it's processing one tensor at a time. Buffering would probably improve this. |
96c8042
to
46d1b6e
Compare
@ikawrakow I removed the changes that I think may have a tradeoff with some platforms or system configurations - mmap is now disabled and the threading is back to normal. Is it still slower than master for you? |
With the latest version (46d1b6e) it behaves better. I get 28 seconds for 30B and For 30B and |
Oh, and if I disable quant histogram collection, then for |
I pushed a very simple modification to the quantization to https://github.com/ggerganov/llama.cpp/tree/ik/quantize_faster. It just changes two things:
With quant histogram collection enabled, your PR is about the same as https://github.com/ggerganov/llama.cpp/tree/ik/quantize_faster on my M2 Max, but slower on a Ryzen 7950X (e.g., 43.7 seconds vs 31.9 seconds for 30B and
where the last argument |
7B source on ramfs, dest is /dev/null
33B source on NVMe, dest on NVMe, deleted after each run
"Old PR" is commit 96c8042. All options are faster than master for me, so for the time being I'd be happy with any of them. I wouldn't want to hurt performance for anyone. |
Quantization time in seconds on M2 Max and Ryzen 7950X for
So, mmap does help this PR on Linux, making it ~14% faster than the one-line change that is labeled "ikawrakow". But using mmap on my 64 GB RAM M2 Max is a disaster (I guess, @ggerganov does not have the issue with 192 GB of RAM on his M2 Ultra). With mmap disabled, the one-line change clearly wins. If the consensus is still somehow that the +241-139 lines change in this PR is better compared to the +1-1 line change, then let's at least add the ability to disable/enable quant histogram collection as in the https://github.com/ggerganov/llama.cpp/tree/ik/quantize_faster branch. I never look at these histograms (but whoever wants to look at them will still be able to) and, as can be seen in the last column of the above table, disabling quant histogram collection brings a significant speedup (well, at least when it comes to quantizing |
I wish I could understand what makes this PR slower on your 7950X. Would you mind stepping through the commits one-by-one, disabling use_mmap on the commits where it is true, and telling me which one brings the regression relative to master? I'm all for avoiding unnecessary code changes, but I don't see why we should ignore any obvious opportunities for optimization, either. Also, this PR is only +129-39 unless you count commit fb2bf51, which is mainly for readability. And 82 of those 129 lines are a highly reusable threadpool.h that I wish was a built-in C++ feature in the first place - like it is in Python. If those changes don't belong here, I can certainly make separate PRs for them. |
And I wish I could understand what makes master so amazingly slow on your 6-core, 12-thread Ryzen 5 3600 system. On the Ryzen 7950X the time needed to do the In any case, given that it is all I/O bound (for In any case, if one wants to optimize beyond the one-line change that moves the
Btw, there was a bug in my implementation of disabling quant histogram collection that lead to no quantized data being written. This was the actual mechanism for the speedup I observed and commented about above. When this bug is corrected, quant histogram collection costs basically no time. |
Don't mean to interrupt the discussion, just want to make a proposal to merge the following 3 commits in order to get the larger part of the improvement:
The rest of the changes including the |
I would be glad to merge those 3 commits if ikawrakow thinks they are OK. |
2d0e4b9
to
76a0b6e
Compare
Merging these 3 commits is OK, although I see zero benefit from e0680ac. Tested on M2 Max and Ryzen 7950X. Here is what I get with a warm start:
|
On my machine, I see a consistent jump from 7.35s down to 7.00s, for 7B on ramfs written as Q4_0 to /dev/null. Averaged over 10 runs each. That's about a 5% difference - it was a bigger change before because the original measurement was with mmap enabled. |
76a0b6e
to
f727ad5
Compare
Tested on both ramfs and NVMe-backed btrfs. My CPU is a 6-core, 12-thread Ryzen 5 3600.
(These numbers are out of date due to some changes being removed, see the discussion. The final speedup for this PR is about 2.7x at best.)