Skip to content

llama : make quantize example up to 2.7x faster #3115

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 15, 2023

Conversation

cebtenzzre
Copy link
Collaborator

@cebtenzzre cebtenzzre commented Sep 10, 2023

Tested on both ramfs and NVMe-backed btrfs. My CPU is a 6-core, 12-thread Ryzen 5 3600.

(These numbers are out of date due to some changes being removed, see the discussion. The final speedup for this PR is about 2.7x at best.)

model format cache master PR speedup
7B Q4_0 ramfs 19054.03 ms 5176.92 ms 3.68
33B Q4_0 cold 127157.88 ms 44832.70 ms 2.84
33B Q4_0 warm 126347.77 ms 44401.31 ms 2.85

@cebtenzzre
Copy link
Collaborator Author

Oops, I forgot that Windows doesn't have pread.

@cebtenzzre cebtenzzre marked this pull request as draft September 10, 2023 23:49
@cebtenzzre cebtenzzre marked this pull request as ready for review September 11, 2023 00:24
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for this to be faster? I doubt it's the thread pool.

Edit: nvm, just saw this information is in the individual commits

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this black magic !?!?

fast-quantize-0.mp4

Edit: using 10 out of 16 threads time goes down to 1.5s

Can we avoid the threadpool.h stuff?
I don't think it brings much to the table in terms of performance, while it introduces too much C++-isms that I don't like

Also - where are the histograms?

@ggerganov ggerganov added the high priority Very important issue label Sep 11, 2023
@bobqianic
Copy link
Contributor

What is this black magic !?!?

https://github.com/ggerganov/llama.cpp/blob/f31b6f4e2d6def3c0bd7c75f75c0c1e8698e0589/llama.cpp#L4950

I think the primary improvement is the relocation of f32_conv_buf. Previously, f32_conv_buf was inside the loop, but now it's outside the loop. This way, there's no need to continuously allocate and deallocate memory.

@ikawrakow
Copy link
Contributor

This PR makes expensive quantizations (k_quants) slower. E.g., for Q4_K_S I get 47.4 seconds with this PR vs 43.4 seconds on Master. When it comes to already fast quantizations (Q4_0, Q4_1), I'm not really bothered by the fact that I have to weight for 5 seconds (instead of 2.5 seconds with this PR) for the quantization to finish.

@cebtenzzre
Copy link
Collaborator Author

cebtenzzre commented Sep 11, 2023

This PR makes expensive quantizations (k_quants) slower.

Could you try commit 97563d3? I think the default threading strategy I came up with may not be ideal in many configurations.

edit: 5 seconds? Quantizing 33B to Q4_0 on master takes two minutes on my hardware. Am I doing something wrong?

@cebtenzzre
Copy link
Collaborator Author

Can we avoid the threadpool.h stuff?

I wanted a thread pool because trying to attach a debugger to quantize produces quite a lot of spam from all the thousands of threads being created - and trying to trace the execution of that mess is probably not fun. I used C++11 because it seemed like the most elegant solution, but I could remove the dependency on std::future and std::packaged_task if you'd like.

@cebtenzzre
Copy link
Collaborator Author

Edit: using 10 out of 16 threads time goes down to 1.5s

Actually, I believe that's a increase of the thread count, because of the way the code was computing nthreads - going from 4x8 (32 total) to 4x10 (40 total). I think I'll change it to divide the command-line parameter too to avoid surprises.

Having all of these extra threads only makes sense because of I/O overhead - one thread can still be doing work while another is blocked on a page fault. Asynchronous I/O would be nice, but that's easier said than done in the Unix world...

@ikawrakow
Copy link
Contributor

ikawrakow commented Sep 11, 2023

edit: 5 seconds? Quantizing 33B to Q4_0 on master takes two minutes on my hardware. Am I doing something wrong?

The numbers I gave in the comment were for 7B. For 33B I have 10 seconds (Master) vs 4.2 seconds (PR) for Q4_0, and 83 seconds (Master) vs 97 seconds (PR) for Q4_K_S. This is on M2 Max.

Oops, sorry. This was 13B. For 30B and Q4_0 I have 35.7 seconds on master and ~60 seconds with this PR. It varies from run to run and it behaves strangely. It runs fast and smooth for some layers, then it becomes choppy, then it becomes faster again.

@cebtenzzre
Copy link
Collaborator Author

cebtenzzre commented Sep 11, 2023

For 30B and Q4_0 I have 35.7 seconds on master and ~60 seconds with this PR.

I suppose I'll have to do some tuning on different platforms. It seems like what applies to Linux on amd64 may not apply to macOS on aarch64, or Windows for that matter. One thing you could try is clearing your disk cache before each run - that's where the extra parallelization should help. Unfortunately, I don't have enough RAM to experiment with anything much larger than a 7B fp16 on ramfs.

The choppiness you're seeing is I/O overhead - you don't notice it so much when it's processing one tensor at a time. Buffering would probably improve this.

@cebtenzzre
Copy link
Collaborator Author

@ikawrakow I removed the changes that I think may have a tradeoff with some platforms or system configurations - mmap is now disabled and the threading is back to normal. Is it still slower than master for you?

@ikawrakow
Copy link
Contributor

With the latest version (46d1b6e) it behaves better. I get 28 seconds for 30B and Q4_0 without it being choppy as before. But I get the exact same 28 seconds with the master version by moving f32_conv_buf out of the loop over tensors as @bobqianic suggested above.

For 30B and Q4_K_S I get 219.4 seconds with this PR vs 215.7 seconds on master with f32_conv_buf moved out of the loop.

@ikawrakow
Copy link
Contributor

ikawrakow commented Sep 12, 2023

Oh, and if I disable quant histogram collection, then for Q4_0 I'm down to 2.7 seconds for 7B and 16 seconds for 30B. This is on master with f32_conv_buf moved out of the loop.

@ikawrakow
Copy link
Contributor

@cebtenzzre

I pushed a very simple modification to the quantization to https://github.com/ggerganov/llama.cpp/tree/ik/quantize_faster. It just changes two things:

  • Moved f32_conv_buf out of the loop over tensors
  • Added the ability to enable/disable quant histogram collection.

With quant histogram collection enabled, your PR is about the same as https://github.com/ggerganov/llama.cpp/tree/ik/quantize_faster on my M2 Max, but slower on a Ryzen 7950X (e.g., 43.7 seconds vs 31.9 seconds for 30B and Q4_0). With quant histogram collection disabled, https://github.com/ggerganov/llama.cpp/tree/ik/quantize_faster is faster than your PR by a large margin. I'm curious to know how https://github.com/ggerganov/llama.cpp/tree/ik/quantize_faster performs on your system. The command line is

./quantize $model $output_file $quant_type $num_threads $enable_histo

where the last argument $enable_histo is 0 or 1, with 0 disabling and 1 enabling histogram collection.

@cebtenzzre
Copy link
Collaborator Author

cebtenzzre commented Sep 12, 2023

7B source on ramfs, dest is /dev/null

model quant master old PR PR PR+mmap ikawrakow
7B Q4_0 20.45 s 5.33 s 6.63 s 4.74 s 7.86 s
7B Q4_K_S 83.28 s 67.00 s 69.02 s 67.82 s 69.72 s

33B source on NVMe, dest on NVMe, deleted after each run
cache is cleared before each run with sync; echo 3 > /proc/sys/vm/drop_caches

model quant master old PR PR PR+mmap ikawrakow
33B Q4_0 153.29 s 63.11 s 93.98 s 82.22 s 94.78 s
33B Q4_K_S 445.30 s 367.33 s 385.64 s 368.34 s 384.47 s

"Old PR" is commit 96c8042.
"PR" is commit 46d1b6e.
"PR+mmap" is 46d1b6e with use_mmap set to true.
"ikawrakow" is commit da030ed with histograms enabled for comparison's sake.
"Old PR" and "PR+mmap" are tied for k-quants because they are essentially doing the same thing.

All options are faster than master for me, so for the time being I'd be happy with any of them. I wouldn't want to hurt performance for anyone.

@ikawrakow
Copy link
Contributor

Quantization time in seconds on M2 Max and Ryzen 7950X for Q4_0:

model, platform This PR This PR + mmap ikawrakow ikawrakow - no histo
33 B, M2 Max 28.0 66.3 28.0 16.0
33B, 7950X 43.7 27.9 31.9 26.9

So, mmap does help this PR on Linux, making it ~14% faster than the one-line change that is labeled "ikawrakow".

But using mmap on my 64 GB RAM M2 Max is a disaster (I guess, @ggerganov does not have the issue with 192 GB of RAM on his M2 Ultra). With mmap disabled, the one-line change clearly wins.

If the consensus is still somehow that the +241-139 lines change in this PR is better compared to the +1-1 line change, then let's at least add the ability to disable/enable quant histogram collection as in the https://github.com/ggerganov/llama.cpp/tree/ik/quantize_faster branch. I never look at these histograms (but whoever wants to look at them will still be able to) and, as can be seen in the last column of the above table, disabling quant histogram collection brings a significant speedup (well, at least when it comes to quantizing Q4_0, Q4_1, Q5_0, Q5_1, Q8_0).

@cebtenzzre
Copy link
Collaborator Author

cebtenzzre commented Sep 13, 2023

Quantization time in seconds on M2 Max and Ryzen 7950X for Q4_0:

I wish I could understand what makes this PR slower on your 7950X. Would you mind stepping through the commits one-by-one, disabling use_mmap on the commits where it is true, and telling me which one brings the regression relative to master?

I'm all for avoiding unnecessary code changes, but I don't see why we should ignore any obvious opportunities for optimization, either.

Also, this PR is only +129-39 unless you count commit fb2bf51, which is mainly for readability. And 82 of those 129 lines are a highly reusable threadpool.h that I wish was a built-in C++ feature in the first place - like it is in Python. If those changes don't belong here, I can certainly make separate PRs for them.

@ikawrakow
Copy link
Contributor

ikawrakow commented Sep 14, 2023

I wish I could understand what makes this PR slower on your 7950X.

And I wish I could understand what makes master so amazingly slow on your 6-core, 12-thread Ryzen 5 3600 system. On the Ryzen 7950X the time needed to do the f16 -> f32 conversion + quantization is only about 5 seconds for 33B and Q4_0 (one can optimize that down to 2.6 seconds, see what I have on my branch, where I have combined f16 -> f32 conversion and quantization). Everything else is I/O. Hence, really hard to understand how you can get 125 seconds, as that would mean that, on the master branch, a Ryzen 5 3600 is somehow 20X times slower than a Ryzen 7950X.

In any case, given that it is all I/O bound (for Q4_0, Q4_K_S is a different story), making thread pools and such cannot help much, if it helps at all. Starting a thread on Linux is pretty much the same as waiting on a wait_condition (~1 us, us = micro second). The cost of starting 32 threads for 420 tensors (needed for 33B quantization) is ~15 ms, so really nothing compared to the overall run time. On the other hand, sometimes you can get unlucky with these wait conditions and only get notified in the next kernel slice, which is a 5 ms delay. You can do a simple experiment: add to your tasks two time stamps. One is taken just before the task is pushed into the queue, the other immediately when the task is extracted from the queue for processing. If you do some statistics on the timestamp difference, you will see that most of the time it is in the range of ~1 us, but occasionally it becomes ~5 ms. If you let that run for a while, you will see that every now and then you will find 10 ms and even 15 ms delays. Now put your system under heavy I/O load and observe how the frequency of the 5 and 10 ms delays increases. Your thread pool needs two mutex locks and two wake-ups from wait conditions to get a task going, so I'm not surprised that the thread pool can become slower than just starting new threads for each tensor. Investigating this as you suggest (go commit by commit and look at how timing changes) is not an easy task as run times fluctuate quite a bit (even when one drops disk caches before each run, which is not a realistic use case anyway because, as a user, I don't want to be dropping caches each time I want to run a quantization).

In any case, if one wants to optimize beyond the one-line change that moves the f32_conv vector out of the loop over tensors, it is the I/O part that one needs to focus on. Using mmap is a winner on Linux independently of the model size. On macOS, if the model is small enough to fit completely in RAM, mmap is a winner too. But once the model size becomes comparable to, or exceeds, the available RAM (33B model on my 64 GB M2 Max), using mmap is a total disaster. Apart from mmap, what else one can do is to make reads and writes run on separate threads, so computations and I/O can be done in parallel without waiting for each other as it is now. To do that properly, the llama_model_quantize_internal() function needs to be completely rewritten. But one can achieve quite a bit with a relative minor change, see what I have pushed to my branch. With these changes, I get the following quantization times in seconds for Q4_0:

Model M2 Max Ryzen 7950X M2 Max, mmap Ryzen 7950X, mmap
7B 3.0 3.9 2.8 3.7
13B 5.4 7.2 5.2 6.8
33B 17.2 31.0 59.1 27.0

Btw, there was a bug in my implementation of disabling quant histogram collection that lead to no quantized data being written. This was the actual mechanism for the speedup I observed and commented about above. When this bug is corrected, quant histogram collection costs basically no time.

@ggerganov
Copy link
Member

Don't mean to interrupt the discussion, just want to make a proposal to merge the following 3 commits in order to get the larger part of the improvement:

The rest of the changes including the ik/quantize_faster branch can go into separate PR and discuss if we really need them. I'm mostly against introducing the threadpool and don't have an opinion about mmap yet.

@cebtenzzre
Copy link
Collaborator Author

just want to make a proposal to merge the following 3 commits

I would be glad to merge those 3 commits if ikawrakow thinks they are OK.

@ikawrakow
Copy link
Contributor

Merging these 3 commits is OK, although I see zero benefit from e0680ac. Tested on M2 Max and Ryzen 7950X. Here is what I get with a warm start:

Case PR PR without e0680ac
M2 Max, 7B 4.95 s 5.05 s
7950X, 7B 5.83 s 5.93 s
M2 Max, 13B 9.25 s 9.45 s
7950X, 13B 11.27 s 10.55 s
M2 Max, 33B 28.03 s 28.08 s
7950X, 33B 41.73 s 41.24 s

@cebtenzzre
Copy link
Collaborator Author

I see zero benefit from e0680ac

On my machine, I see a consistent jump from 7.35s down to 7.00s, for 7B on ramfs written as Q4_0 to /dev/null. Averaged over 10 runs each. That's about a 5% difference - it was a bigger change before because the original measurement was with mmap enabled.

@cebtenzzre cebtenzzre changed the title llama : make quantize example up to 3.7x faster llama : make quantize example up to 2.7x faster Sep 15, 2023
@cebtenzzre cebtenzzre merged commit 98311c4 into ggml-org:master Sep 15, 2023
@cebtenzzre cebtenzzre deleted the faster-quantize branch September 15, 2023 01:09
@cebtenzzre cebtenzzre restored the faster-quantize branch September 15, 2023 01:10
pkrmf pushed a commit to morlockstudios-com/llama.cpp that referenced this pull request Sep 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants