Multi-threaded quantization #1075

ikawrakow · 2023-04-20T05:45:29Z

This PR adds multi-threading for quantization.

The gain is very minor for small models (e.g., LLaMA 7B) and simple quantization (Q4_0 and Q4_1), but very significant for large models and the now more elaborate Q4_2 quantization.

quantize-stats now finishes in just 14.5 seconds (7B) or 44 seconds (13B) on my computer for all 3 quantization types. The single-threaded version took 144 seconds (7B) or 242 seconds (13B).

Not much gain for simple quantizations, bit it will be important for quantizations that require more CPU cycles.

It now does the job in ~14 seconds on my Mac for Q4_0, Q4_1 and Q4_2. Single-threaded it was taking more than 2 minutes after adding the more elaborate version of Q4_2.

DannyDaemonic · 2023-04-20T09:41:22Z

This could make more accurate but slow quantization methods more practical. (See #835.)

llama.cpp

ggml.c

@ggerganov

After changing chunk_size to const int as suggested by @ggerganov, clang and GCC starting to warn me that I don't need to capture it in the lambda. So, I removed it from the capture list. But that makes the MSVC build fail. So, making it a constexpr to make every compiler happy.

prusnak · 2023-04-20T17:40:30Z

Please resolve conflicts with the master branch

Kawrakow added 2 commits April 19, 2023 20:22

Multi-threading quantization.

d2f9266

Not much gain for simple quantizations, bit it will be important for quantizations that require more CPU cycles.

Multi-threading for quantize-stats

ce05fc0

It now does the job in ~14 seconds on my Mac for Q4_0, Q4_1 and Q4_2. Single-threaded it was taking more than 2 minutes after adding the more elaborate version of Q4_2.

ikawrakow requested review from sw and unbounded April 20, 2023 05:45

ggerganov approved these changes Apr 20, 2023

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

llama.cpp Outdated Show resolved Hide resolved

llama.cpp Outdated Show resolved Hide resolved

ggml.c Outdated Show resolved Hide resolved

ggerganov added the performance Speed related topics label Apr 20, 2023

Kawrakow added 3 commits April 20, 2023 18:17

Reviewer comments

b65e559

Still fighting with lambda captures in MSVC

0ae02eb

Merge branch 'master' into multi-thread-quantize

b3545d9

ggerganov merged commit 38de86a into master Apr 20, 2023

ggerganov deleted the multi-thread-quantize branch April 20, 2023 17:42

ikawrakow mentioned this pull request Apr 21, 2023

RMSE-optimized quants for all quantization types #1106

Closed

ggerganov assigned ikawrakow Apr 22, 2023

sw mentioned this pull request Apr 22, 2023

perf: parallelize quantization #906

Closed

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-threaded quantization #1075

Multi-threaded quantization #1075

Uh oh!

ikawrakow commented Apr 20, 2023

Uh oh!

DannyDaemonic commented Apr 20, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

prusnak commented Apr 20, 2023

Uh oh!

Uh oh!

Multi-threaded quantization #1075

Multi-threaded quantization #1075

Uh oh!

Conversation

ikawrakow commented Apr 20, 2023

Uh oh!

DannyDaemonic commented Apr 20, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

prusnak commented Apr 20, 2023

Uh oh!

Uh oh!