llama : make quantize example up to 2.7x faster #3115

cebtenzzre · 2023-09-10T23:46:46Z

Tested on both ramfs and NVMe-backed btrfs. My CPU is a 6-core, 12-thread Ryzen 5 3600.

(These numbers are out of date due to some changes being removed, see the discussion. The final speedup for this PR is about 2.7x at best.)

model	format	cache	master	PR	speedup
7B	Q4_0	ramfs	19054.03 ms	5176.92 ms	3.68
33B	Q4_0	cold	127157.88 ms	44832.70 ms	2.84
33B	Q4_0	warm	126347.77 ms	44401.31 ms	2.85

cebtenzzre · 2023-09-10T23:49:26Z

Oops, I forgot that Windows doesn't have pread.

llama.cpp

ggerganov

What is the reason for this to be faster? I doubt it's the thread pool.

Edit: nvm, just saw this information is in the individual commits

ggerganov

What is this black magic !?!?

fast-quantize-0.mp4

Edit: using 10 out of 16 threads time goes down to 1.5s

Can we avoid the threadpool.h stuff?
I don't think it brings much to the table in terms of performance, while it introduces too much C++-isms that I don't like

Also - where are the histograms?

bobqianic · 2023-09-11T09:06:17Z

What is this black magic !?!?

https://github.com/ggerganov/llama.cpp/blob/f31b6f4e2d6def3c0bd7c75f75c0c1e8698e0589/llama.cpp#L4950

I think the primary improvement is the relocation of f32_conv_buf. Previously, f32_conv_buf was inside the loop, but now it's outside the loop. This way, there's no need to continuously allocate and deallocate memory.

ikawrakow · 2023-09-11T15:22:37Z

This PR makes expensive quantizations (k_quants) slower. E.g., for Q4_K_S I get 47.4 seconds with this PR vs 43.4 seconds on Master. When it comes to already fast quantizations (Q4_0, Q4_1), I'm not really bothered by the fact that I have to weight for 5 seconds (instead of 2.5 seconds with this PR) for the quantization to finish.

cebtenzzre · 2023-09-11T15:30:57Z

This PR makes expensive quantizations (k_quants) slower.

Could you try commit 97563d3? I think the default threading strategy I came up with may not be ideal in many configurations.

edit: 5 seconds? Quantizing 33B to Q4_0 on master takes two minutes on my hardware. Am I doing something wrong?

cebtenzzre · 2023-09-11T16:15:30Z

Can we avoid the threadpool.h stuff?

I wanted a thread pool because trying to attach a debugger to quantize produces quite a lot of spam from all the thousands of threads being created - and trying to trace the execution of that mess is probably not fun. I used C++11 because it seemed like the most elegant solution, but I could remove the dependency on std::future and std::packaged_task if you'd like.

cebtenzzre · 2023-09-11T16:54:50Z

Edit: using 10 out of 16 threads time goes down to 1.5s

Actually, I believe that's a increase of the thread count, because of the way the code was computing nthreads - going from 4x8 (32 total) to 4x10 (40 total). I think I'll change it to divide the command-line parameter too to avoid surprises.

Having all of these extra threads only makes sense because of I/O overhead - one thread can still be doing work while another is blocked on a page fault. Asynchronous I/O would be nice, but that's easier said than done in the Unix world...

ikawrakow · 2023-09-11T16:54:54Z

edit: 5 seconds? Quantizing 33B to Q4_0 on master takes two minutes on my hardware. Am I doing something wrong?

~~The numbers I gave in the comment were for 7B. For 33B I have 10 seconds (Master) vs 4.2 seconds (PR) for Q4_0, and 83 seconds (Master) vs 97 seconds (PR) for Q4_K_S. This is on M2 Max.~~

Oops, sorry. This was 13B. For 30B and Q4_0 I have 35.7 seconds on master and ~60 seconds with this PR. It varies from run to run and it behaves strangely. It runs fast and smooth for some layers, then it becomes choppy, then it becomes faster again.

cebtenzzre · 2023-09-11T17:39:01Z

For 30B and Q4_0 I have 35.7 seconds on master and ~60 seconds with this PR.

I suppose I'll have to do some tuning on different platforms. It seems like what applies to Linux on amd64 may not apply to macOS on aarch64, or Windows for that matter. One thing you could try is clearing your disk cache before each run - that's where the extra parallelization should help. Unfortunately, I don't have enough RAM to experiment with anything much larger than a 7B fp16 on ramfs.

The choppiness you're seeing is I/O overhead - you don't notice it so much when it's processing one tensor at a time. Buffering would probably improve this.

cebtenzzre · 2023-09-12T06:16:08Z

@ikawrakow I removed the changes that I think may have a tradeoff with some platforms or system configurations - mmap is now disabled and the threading is back to normal. Is it still slower than master for you?

ikawrakow · 2023-09-12T08:15:08Z

With the latest version (46d1b6e) it behaves better. I get 28 seconds for 30B and Q4_0 without it being choppy as before. But I get the exact same 28 seconds with the master version by moving f32_conv_buf out of the loop over tensors as @bobqianic suggested above.

For 30B and Q4_K_S I get 219.4 seconds with this PR vs 215.7 seconds on master with f32_conv_buf moved out of the loop.

ikawrakow · 2023-09-12T08:26:12Z

Oh, and if I disable quant histogram collection, then for Q4_0 I'm down to 2.7 seconds for 7B and 16 seconds for 30B. This is on master with f32_conv_buf moved out of the loop.

ikawrakow · 2023-09-12T10:41:09Z

@cebtenzzre

I pushed a very simple modification to the quantization to https://github.com/ggerganov/llama.cpp/tree/ik/quantize_faster. It just changes two things:

Moved f32_conv_buf out of the loop over tensors
Added the ability to enable/disable quant histogram collection.

With quant histogram collection enabled, your PR is about the same as https://github.com/ggerganov/llama.cpp/tree/ik/quantize_faster on my M2 Max, but slower on a Ryzen 7950X (e.g., 43.7 seconds vs 31.9 seconds for 30B and Q4_0). With quant histogram collection disabled, https://github.com/ggerganov/llama.cpp/tree/ik/quantize_faster is faster than your PR by a large margin. I'm curious to know how https://github.com/ggerganov/llama.cpp/tree/ik/quantize_faster performs on your system. The command line is

./quantize $model $output_file $quant_type $num_threads $enable_histo

where the last argument $enable_histo is 0 or 1, with 0 disabling and 1 enabling histogram collection.

cebtenzzre · 2023-09-12T21:24:13Z

7B source on ramfs, dest is /dev/null

model	quant	master	old PR	PR	PR+mmap	ikawrakow
7B	Q4_0	20.45 s	5.33 s	6.63 s	4.74 s	7.86 s
7B	Q4_K_S	83.28 s	67.00 s	69.02 s	67.82 s	69.72 s

33B source on NVMe, dest on NVMe, deleted after each run
cache is cleared before each run with sync; echo 3 > /proc/sys/vm/drop_caches

model	quant	master	old PR	PR	PR+mmap	ikawrakow
33B	Q4_0	153.29 s	63.11 s	93.98 s	82.22 s	94.78 s
33B	Q4_K_S	445.30 s	367.33 s	385.64 s	368.34 s	384.47 s

"Old PR" is commit 96c8042.
"PR" is commit 46d1b6e.
"PR+mmap" is 46d1b6e with use_mmap set to true.
"ikawrakow" is commit da030ed with histograms enabled for comparison's sake.
"Old PR" and "PR+mmap" are tied for k-quants because they are essentially doing the same thing.

All options are faster than master for me, so for the time being I'd be happy with any of them. I wouldn't want to hurt performance for anyone.

ikawrakow · 2023-09-13T13:49:24Z

Quantization time in seconds on M2 Max and Ryzen 7950X for Q4_0:

model, platform	This PR	This PR + mmap	ikawrakow	ikawrakow - no histo
33 B, M2 Max	28.0	66.3	28.0	16.0
33B, 7950X	43.7	27.9	31.9	26.9

So, mmap does help this PR on Linux, making it ~14% faster than the one-line change that is labeled "ikawrakow".

But using mmap on my 64 GB RAM M2 Max is a disaster (I guess, @ggerganov does not have the issue with 192 GB of RAM on his M2 Ultra). With mmap disabled, the one-line change clearly wins.

If the consensus is still somehow that the +241-139 lines change in this PR is better compared to the +1-1 line change, then let's at least add the ability to disable/enable quant histogram collection as in the https://github.com/ggerganov/llama.cpp/tree/ik/quantize_faster branch. I never look at these histograms (but whoever wants to look at them will still be able to) and, as can be seen in the last column of the above table, disabling quant histogram collection brings a significant speedup (well, at least when it comes to quantizing Q4_0, Q4_1, Q5_0, Q5_1, Q8_0).

cebtenzzre · 2023-09-13T18:27:30Z

Quantization time in seconds on M2 Max and Ryzen 7950X for Q4_0:

I wish I could understand what makes this PR slower on your 7950X. Would you mind stepping through the commits one-by-one, disabling use_mmap on the commits where it is true, and telling me which one brings the regression relative to master?

I'm all for avoiding unnecessary code changes, but I don't see why we should ignore any obvious opportunities for optimization, either.

Also, this PR is only +129-39 unless you count commit fb2bf51, which is mainly for readability. And 82 of those 129 lines are a highly reusable threadpool.h that I wish was a built-in C++ feature in the first place - like it is in Python. If those changes don't belong here, I can certainly make separate PRs for them.

ikawrakow · 2023-09-14T13:38:51Z

I wish I could understand what makes this PR slower on your 7950X.

And I wish I could understand what makes master so amazingly slow on your 6-core, 12-thread Ryzen 5 3600 system. On the Ryzen 7950X the time needed to do the f16 -> f32 conversion + quantization is only about 5 seconds for 33B and Q4_0 (one can optimize that down to 2.6 seconds, see what I have on my branch, where I have combined f16 -> f32 conversion and quantization). Everything else is I/O. Hence, really hard to understand how you can get 125 seconds, as that would mean that, on the master branch, a Ryzen 5 3600 is somehow 20X times slower than a Ryzen 7950X.

In any case, given that it is all I/O bound (for Q4_0, Q4_K_S is a different story), making thread pools and such cannot help much, if it helps at all. Starting a thread on Linux is pretty much the same as waiting on a wait_condition (~1 us, us = micro second). The cost of starting 32 threads for 420 tensors (needed for 33B quantization) is ~15 ms, so really nothing compared to the overall run time. On the other hand, sometimes you can get unlucky with these wait conditions and only get notified in the next kernel slice, which is a 5 ms delay. You can do a simple experiment: add to your tasks two time stamps. One is taken just before the task is pushed into the queue, the other immediately when the task is extracted from the queue for processing. If you do some statistics on the timestamp difference, you will see that most of the time it is in the range of ~1 us, but occasionally it becomes ~5 ms. If you let that run for a while, you will see that every now and then you will find 10 ms and even 15 ms delays. Now put your system under heavy I/O load and observe how the frequency of the 5 and 10 ms delays increases. Your thread pool needs two mutex locks and two wake-ups from wait conditions to get a task going, so I'm not surprised that the thread pool can become slower than just starting new threads for each tensor. Investigating this as you suggest (go commit by commit and look at how timing changes) is not an easy task as run times fluctuate quite a bit (even when one drops disk caches before each run, which is not a realistic use case anyway because, as a user, I don't want to be dropping caches each time I want to run a quantization).

In any case, if one wants to optimize beyond the one-line change that moves the f32_conv vector out of the loop over tensors, it is the I/O part that one needs to focus on. Using mmap is a winner on Linux independently of the model size. On macOS, if the model is small enough to fit completely in RAM, mmap is a winner too. But once the model size becomes comparable to, or exceeds, the available RAM (33B model on my 64 GB M2 Max), using mmap is a total disaster. Apart from mmap, what else one can do is to make reads and writes run on separate threads, so computations and I/O can be done in parallel without waiting for each other as it is now. To do that properly, the llama_model_quantize_internal() function needs to be completely rewritten. But one can achieve quite a bit with a relative minor change, see what I have pushed to my branch. With these changes, I get the following quantization times in seconds for Q4_0:

Model	M2 Max	Ryzen 7950X	M2 Max, mmap	Ryzen 7950X, mmap
7B	3.0	3.9	2.8	3.7
13B	5.4	7.2	5.2	6.8
33B	17.2	31.0	59.1	27.0

Btw, there was a bug in my implementation of disabling quant histogram collection that lead to no quantized data being written. This was the actual mechanism for the speedup I observed and commented about above. When this bug is corrected, quant histogram collection costs basically no time.

ggerganov · 2023-09-14T15:37:43Z

Don't mean to interrupt the discussion, just want to make a proposal to merge the following 3 commits in order to get the larger part of the improvement:

The rest of the changes including the ik/quantize_faster branch can go into separate PR and discuss if we really need them. I'm mostly against introducing the threadpool and don't have an opinion about mmap yet.

cebtenzzre · 2023-09-14T15:43:05Z

just want to make a proposal to merge the following 3 commits

I would be glad to merge those 3 commits if ikawrakow thinks they are OK.

ikawrakow · 2023-09-14T17:32:53Z

Merging these 3 commits is OK, although I see zero benefit from e0680ac. Tested on M2 Max and Ryzen 7950X. Here is what I get with a warm start:

Case	PR	PR without `e0680ac`
M2 Max, 7B	4.95 s	5.05 s
7950X, 7B	5.83 s	5.93 s
M2 Max, 13B	9.25 s	9.45 s
7950X, 13B	11.27 s	10.55 s
M2 Max, 33B	28.03 s	28.08 s
7950X, 33B	41.73 s	41.24 s

cebtenzzre · 2023-09-15T01:05:29Z

I see zero benefit from e0680ac

On my machine, I see a consistent jump from 7.35s down to 7.00s, for 7B on ramfs written as Q4_0 to /dev/null. Averaged over 10 runs each. That's about a 5% difference - it was a bigger change before because the original measurement was with mmap enabled.

cebtenzzre marked this pull request as draft September 10, 2023 23:49

cebtenzzre marked this pull request as ready for review September 11, 2023 00:24

slaren reviewed Sep 11, 2023

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

ggerganov reviewed Sep 11, 2023

View reviewed changes

ggerganov added the high priority Very important issue label Sep 11, 2023

cebtenzzre force-pushed the faster-quantize branch from 96c8042 to 46d1b6e Compare September 12, 2023 06:13

cebtenzzre added 2 commits September 14, 2023 11:44

llama : refactor k-quant mixture logic into a function

0c64968

llama : optimize vector use in quantize -> 179% faster

a95aa21

cebtenzzre force-pushed the faster-quantize branch from 2d0e4b9 to 76a0b6e Compare September 14, 2023 15:44

ggerganov approved these changes Sep 14, 2023

View reviewed changes

cebtenzzre changed the title ~~llama : make quantize example up to 3.7x faster~~ llama : make quantize example up to 2.7x faster Sep 15, 2023

llama : don't zero-init vectors in quantize -> 5.1% faster

f727ad5

cebtenzzre force-pushed the faster-quantize branch from 76a0b6e to f727ad5 Compare September 15, 2023 01:06

cebtenzzre merged commit 98311c4 into ggml-org:master Sep 15, 2023

cebtenzzre deleted the faster-quantize branch September 15, 2023 01:09

cebtenzzre restored the faster-quantize branch September 15, 2023 01:10

cebtenzzre mentioned this pull request Sep 16, 2023

llama : quantize up to 31% faster on Linux with mmap #3206

Merged

pkrmf pushed a commit to morlockstudios-com/llama.cpp that referenced this pull request Sep 26, 2023

llama : make quantize example up to 2.7x faster (ggml-org#3115)

f4ec97e

dranger003 mentioned this pull request Apr 14, 2024

model: support arch DbrxForCausalLM #6515

Merged

13 tasks

llama : make quantize example up to 2.7x faster #3115

llama : make quantize example up to 2.7x faster #3115

Uh oh!

Conversation

cebtenzzre commented Sep 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cebtenzzre commented Sep 10, 2023

Uh oh!

Uh oh!

ggerganov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bobqianic commented Sep 11, 2023

Uh oh!

ikawrakow commented Sep 11, 2023

Uh oh!

cebtenzzre commented Sep 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cebtenzzre commented Sep 11, 2023

Uh oh!

cebtenzzre commented Sep 11, 2023

Uh oh!

ikawrakow commented Sep 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cebtenzzre commented Sep 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cebtenzzre commented Sep 12, 2023

Uh oh!

ikawrakow commented Sep 12, 2023

Uh oh!

ikawrakow commented Sep 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Sep 12, 2023

Uh oh!

cebtenzzre commented Sep 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Sep 13, 2023

Uh oh!

cebtenzzre commented Sep 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Sep 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Sep 14, 2023

Uh oh!

cebtenzzre commented Sep 14, 2023

Uh oh!

ikawrakow commented Sep 14, 2023

Uh oh!

cebtenzzre commented Sep 15, 2023

Uh oh!

Uh oh!

cebtenzzre commented Sep 10, 2023 •

edited

Loading

ggerganov left a comment •

edited

Loading

ggerganov left a comment •

edited

Loading

cebtenzzre commented Sep 11, 2023 •

edited

Loading

ikawrakow commented Sep 11, 2023 •

edited

Loading

cebtenzzre commented Sep 11, 2023 •

edited

Loading

ikawrakow commented Sep 12, 2023 •

edited

Loading

cebtenzzre commented Sep 12, 2023 •

edited

Loading

cebtenzzre commented Sep 13, 2023 •

edited

Loading

ikawrakow commented Sep 14, 2023 •

edited

Loading