Replies: 1 comment
-
|
@pwilkin This is something I really appreciate. I've dipped my toe into the code, but it is easy to get overwhelmed. I use the llama.cpp project in my classes for graduates in AI to study inference engine implementations. This project is a perfect tool for me to use when working with non-programmers from the consumer-of-tools PoV, as well as experienced machine learning programmers looking to look under the hood. Thanks !! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I know this has been asked around and I feel like there isn't too much documentation about the quant types themselves, so I asked a friendly neighborhood LLM to analyze
ggml-quants.cand cook a document that describes the various quantization types, with a bit of my guidance.Do you think this is something that might be worth adding to the documentation? (note: this is just a draft I'm throwing out to see if it makes any sense, if there's support I'll review it more carefully)
GGML Quantization
The GGML library employs a block-based quantization strategy to compress tensors, reducing their memory footprint and computational cost. This document details the general architecture and the specifics of each quantization type.
General Architecture
In GGML, a row of a matrix (or a 1D tensor) is partitioned into fixed-size blocks. Each block is quantized independently, which allows for a good balance between compression ratio and accuracy.
A quantized block typically consists of:
ggml_half).The process involves:
floatdata. Then, convert eachfloatvalue to its corresponding low-precision integer representation.floatvalues, the low-precision integers are converted back to floating-point numbers using the stored scale(s) and/or minimums.This block-based approach is central to all GGML quantization types.
Block Size Constants
The following block size constants are used throughout the implementation:
QK4_0,QK4_1,QK5_0,QK5_1,QK8_0,QK8_1: 32 elements (standard quantization)QK_K: 256 elements (super-block quantization for_Ktypes)QK_MXFP4: 32 elements (MXFP4 quantization)Importance Matrix Support
Many quantization functions support an optional importance matrix (also called "Activation aWare Quantization"). When provided, this matrix contains weights that prioritize certain elements during quantization, leading to better preservation of important values. The importance matrix is used in the
_implversions of quantization functions and affects the scale calculation and quantization error minimization.Standard Quantization Types
These are the fundamental quantization schemes that operate on blocks of 32 elements.
Q4_0
QK4_0).amaxin the block.d = max / -8wheremaxis the value with maximum absolute magnitude. This maps the float range[-amax, amax]to the integer range[-8, 7].xis quantized to a 4-bit integerqifrom 0 to 15:qi = MIN(15, (int8_t)(x / d + 8.5f)).(qi - 8) * d.nearest_int()for rounding and handles edge cases wheredis zero.Q4_1
QK4_1).minand maximummaxvalues in the block.d = (max - min) / 15and store the minimumm = min.xis quantized to a 4-bit integerqifrom 0 to 15:qi = MIN(15, (int8_t)((x - min) / d + 0.5f)).qi * d + m.qi * d - m.Q5_0
Q4_0. It offers more precision by using 5 bits per weight.QK5_0).Q4_0, but it quantizes values to a 5-bit range [-16, 15]. The lower 4 bits are stored inqsand the 5th (highest) bit for all 32 values is packed into theqharray as a 32-bit integer.d = max / -16, values quantized toMIN(31, (int8_t)(x / d + 16.5f)).(qi - 16) * dwhere the 5th bit is reconstructed fromqh.Q5_1
Q4_1. Asymmetric quantization with higher precision.QK5_1).Q4_1with 5-bit precision, storing the 5th bit inqh.Q8_0
QK8_0).amax(maximum absolute value).d = amax / 127.xis quantized toroundf(x / d), anint8_t.qi * d.Q8_1
QK8_1).d, it pre-calculates and stores the scaled sum of the quantized valuess = d * sum(qs[i]). This can be used to accelerate certain operations like dot products.MXFP4
QK_MXFP4).amaxin the block.e = floor(log2(amax)) - 2 + 127(stored as E8M0 format).d = GGML_E8M0_TO_FP32_HALF(e).kvalues_mxfp4grid scaled byd.kvalues_mxfp4[16]for non-linear quantization and E8M0 floating-point format for the exponent.Ternary Quantization Types (
TQPrefix)These quantization schemes implement ternary quantization (values in {-1, 0, 1}), designed for models like BitNet b1.58 and TriLMs. They provide extremely efficient storage for neural networks that can operate with only three weight states.
TQ1_0
QK_K).amaxin the block and set scaled = amax.xi = lroundf(x / d) + 1(maps {-1, 0, 1} to {0, 1, 2}).q = q * 3 + xi.q = ((uint16_t)q * 256 + 242) / 243.qs(5 per byte) and remaining values inqh(4 per byte).{1, 3, 9, 27, 81, 243}to extract ternary values:xi = ((uint16_t)q * 3) >> 8, thenvalue = (xi - 1) * d.TQ2_0
QK_K).amaxin the block and set scaled = amax.xi = lroundf(x / d) + 1(maps {-1, 0, 1} to {0, 1, 2}).q += (xi & 3) << (2*n).q = (qs[j] >> (l*2)) & 3, thenvalue = (q - 1) * d.Super-Block Quantization Types (
_KSuffix)The
_Kquantization types use "super-blocks" of 256 elements (QK_K). These super-blocks are composed of smaller sub-blocks. The key innovation is that the scales and minimums of the sub-blocks are themselves quantized, leading to a higher compression ratio and better accuracy, as the scaling is more localized. The "K" in the name refers to the k-means clustering used to create the quantization tables.Q2_K
scalesarray (scale in lower 4 bits, min in upper 4 bits).d = max_scale/15) and super-block minimum (dmin = max_min/15) are used to dequantize the sub-block scales and minimums.make_qkx3_quants()andmake_qp_quants()for optimal scale calculation with importance weighting.Q3_K
x = scale * q). The 16 sub-block scales are quantized to 6 bits and packed in a complex 12-byte format. The 3-bit weights are split, with the lowest 2 bits inqsand the highest bit inhmask.hmaskstores the high bit for each group of 8 quantized valuesmake_q3_quants()for RMSE-optimized quantizationQ4_K
x = scale * q + min). It divides the super-block into 8 sub-blocks of 32 elements, quantizing their scales and mins into a 12-bytescalesarray.Q5_K
Q4_K, offering higher precision. This is a good choice for users who want a bit more quality than Q4_K without a large increase in file size.Q4_Kstructure but accommodates 5-bit weights by adding theqharray for the 5th bit.Q6_K
Importance Matrix (IQ) and Non-Linear Quantization Types
These are advanced schemes for very low bit-rates. They often use non-linear quantization grids and can leverage an "importance matrix" during quantization to selectively preserve important weights, yielding better performance. These are newer and more experimental than the other quantization types.
IQ4_NL
kvalues_iq4nlgrid, which is then scaled byd.IQ3_XXS
iq3xxs_grid) to quantize values. The 3-bit quantized integers are indices into this grid. The final value isd * grid[index].IQ2_XXS
uint16_tinqsencodes information for 8 floats. It uses parts of the bits to index into a signs table (ksigns_iq2xs) and a quantization grid (iq2xxs_grid).IQ2_XS
IQ2_XXSbut with an additional layer of scaling for sub-blocks of 32, providing more fine-grained quantization.IQ2_S
IQ3_S
IQ4_XS
scales_h) and low (scales_l) bits.IQ1_S
iq1s_grid).IQ1_M
IQ1_S.IQ1_S, it does not have a global scaled. Instead, it relies entirely on 3-bit quantized scales for its sub-blocks.Implementation Details
Quantization Functions
Each quantization type has multiple implementation variants:
_refsuffix): Deterministic implementations used for creating model files, ensuring reproducibility across platforms._implfunctions): Optimized implementations that use an importance matrix to preserve critical weights during quantization.quantize_<type>): High-level functions that choose between reference and importance-matrix implementations based on whether an importance matrix is provided.Error Thresholds
Different quantization types use specific epsilon thresholds for handling near-zero blocks:
Helper Functions
Key internal functions used across quantization implementations:
nearest_int(): Fast integer rounding with specific bounds checkingmake_qx_quants(): General quantization with RMSE optimization and weightingmake_q3_quants(): Specialized 3-bit quantization with iterative refinementmake_qkx1_quants(),make_qkx2_quants(),make_qkx3_quants(): Multi-level quantization with different optimization strategiesbest_index_int8(): Binary search for finding best match in quantization gridsBit Packing Patterns
The implementation uses sophisticated bit packing to maximize storage efficiency:
Beta Was this translation helpful? Give feedback.
All reactions