Quant type documentation #17393

pwilkin · 2025-11-19T23:09:32Z

pwilkin
Nov 19, 2025
Collaborator

I know this has been asked around and I feel like there isn't too much documentation about the quant types themselves, so I asked a friendly neighborhood LLM to analyze ggml-quants.c and cook a document that describes the various quantization types, with a bit of my guidance.

Do you think this is something that might be worth adding to the documentation? (note: this is just a draft I'm throwing out to see if it makes any sense, if there's support I'll review it more carefully)

GGML Quantization

The GGML library employs a block-based quantization strategy to compress tensors, reducing their memory footprint and computational cost. This document details the general architecture and the specifics of each quantization type.

General Architecture

In GGML, a row of a matrix (or a 1D tensor) is partitioned into fixed-size blocks. Each block is quantized independently, which allows for a good balance between compression ratio and accuracy.

A quantized block typically consists of:

Scale factor(s): One or more floating-point values that map the quantized integers back to their approximate original range. These are often stored as 16-bit half-precision floats (ggml_half).
Quantized values: The original floating-point values in the block are mapped to lower-precision integers (e.g., 4-bit, 8-bit). These integers are packed together to save space.

The process involves:

Quantization: For each block, calculate the scale(s) and/or minimum values based on the original float data. Then, convert each float value to its corresponding low-precision integer representation.
Dequantization: To reconstruct the approximate float values, the low-precision integers are converted back to floating-point numbers using the stored scale(s) and/or minimums.

This block-based approach is central to all GGML quantization types.

Block Size Constants

The following block size constants are used throughout the implementation:

QK4_0, QK4_1, QK5_0, QK5_1, QK8_0, QK8_1: 32 elements (standard quantization)
QK_K: 256 elements (super-block quantization for _K types)
QK_MXFP4: 32 elements (MXFP4 quantization)

Importance Matrix Support

Many quantization functions support an optional importance matrix (also called "Activation aWare Quantization"). When provided, this matrix contains weights that prioritize certain elements during quantization, leading to better preservation of important values. The importance matrix is used in the _impl versions of quantization functions and affects the scale calculation and quantization error minimization.

Standard Quantization Types

These are the fundamental quantization schemes that operate on blocks of 32 elements.

Q4_0

Description: An early 4-bit quantization scheme. Each block has a single scale factor. The values are assumed to be symmetric around zero. It is a legacy format and is generally not recommended for new models.
Block Size: 32 floats (QK4_0).
Structure:

typedef struct {
    ggml_half d;           // delta
    uint8_t qs[QK4_0 / 2]; // nibbles / quants
} block_q4_0;

Quantization:
1. Find the maximum absolute value amax in the block.
2. Calculate the scale d = max / -8 where max is the value with maximum absolute magnitude. This maps the float range [-amax, amax] to the integer range [-8, 7].
3. Each float x is quantized to a 4-bit integer qi from 0 to 15: qi = MIN(15, (int8_t)(x / d + 8.5f)).
Dequantization: The float value is recovered as (qi - 8) * d.
Implementation Notes: Uses nearest_int() for rounding and handles edge cases where d is zero.

Q4_1

Description: A 4-bit quantization scheme with a scale and a minimum value for each block. This is more accurate than Q4_0 for data that is not centered at zero.
Block Size: 32 floats (QK4_1).
Structure:

typedef struct {
    ggml_half d; // delta
    ggml_half m; // min
    uint8_t qs[QK4_1 / 2]; // nibbles / quants
} block_q4_1;

Quantization:
1. Find the minimum min and maximum max values in the block.
2. Calculate the scale d = (max - min) / 15 and store the minimum m = min.
3. Each float x is quantized to a 4-bit integer qi from 0 to 15: qi = MIN(15, (int8_t)((x - min) / d + 0.5f)).
Dequantization: The float value is qi * d + m.
Implementation Notes: The minimum value is stored as a negative value in the structure, so dequantization uses qi * d - m.

Q5_0

Description: A 5-bit version of Q4_0. It offers more precision by using 5 bits per weight.
Block Size: 32 floats (QK5_0).
Structure:

typedef struct {
    ggml_half d;           // delta
    uint8_t qh[4];         // 5-th bit of quants
    uint8_t qs[QK5_0 / 2]; // nibbles / quants
} block_q5_0;

Algorithm: Similar to Q4_0, but it quantizes values to a 5-bit range [-16, 15]. The lower 4 bits are stored in qs and the 5th (highest) bit for all 32 values is packed into the qh array as a 32-bit integer.
Quantization: Scale d = max / -16, values quantized to MIN(31, (int8_t)(x / d + 16.5f)).
Dequantization: Values recovered as (qi - 16) * d where the 5th bit is reconstructed from qh.

Q5_1

Description: A 5-bit version of Q4_1. Asymmetric quantization with higher precision.
Block Size: 32 floats (QK5_1).
Structure:

typedef struct {
    ggml_half d; // delta
    ggml_half m; // min
    uint8_t qh[4];         // 5-th bit of quants
    uint8_t qs[QK5_1 / 2]; // nibbles / quants
} block_q5_1;

Algorithm: Combines the asymmetric approach of Q4_1 with 5-bit precision, storing the 5th bit in qh.

Q8_0

Description: An 8-bit quantization scheme with a single scale, for values symmetric around zero. This is a high-precision quantization that is often used for the more sensitive parts of the model.
Block Size: 32 floats (QK8_0).
Structure:

typedef struct {
    ggml_half d;       // delta
    int8_t  qs[QK8_0]; // quants
} block_q8_0;

Quantization:
1. Find amax (maximum absolute value).
2. d = amax / 127.
3. Each float x is quantized to roundf(x / d), an int8_t.
Dequantization: The float value is qi * d.
Implementation Notes: Uses the full range of signed 8-bit integers [-127, 127].

Q8_1

Description: An 8-bit quantization scheme that also stores the sum of the quantized values.
Block Size: 32 floats (QK8_1).
Structure:

typedef struct {
    ggml_half d; // delta
    ggml_half s; // d * sum(qs[i])
    int8_t qs[QK8_1]; // quants
} block_q8_1;

Algorithm: In addition to the scale d, it pre-calculates and stores the scaled sum of the quantized values s = d * sum(qs[i]). This can be used to accelerate certain operations like dot products.
Implementation Notes: Note that there is no dequantization function provided for Q8_1 in the current implementation.

MXFP4

Description: A 4-bit quantization scheme using floating-point exponent scaling. Uses a predefined non-linear grid of 16 values optimized for typical weight distributions.
Block Size: 32 floats (QK_MXFP4).
Structure:

typedef struct {
    uint8_t e;              // exponent as E8M0 format
    uint8_t qs[QK_MXFP4/2]; // 4-bit quants
} block_mxfp4;

Algorithm:
1. Find the maximum absolute value amax in the block.
2. Calculate exponent e = floor(log2(amax)) - 2 + 127 (stored as E8M0 format).
3. Scale d = GGML_E8M0_TO_FP32_HALF(e).
4. Each float is mapped to the closest value in the predefined kvalues_mxfp4 grid scaled by d.
Implementation Notes: Uses a specialized grid kvalues_mxfp4[16] for non-linear quantization and E8M0 floating-point format for the exponent.

Ternary Quantization Types (`TQ` Prefix)

These quantization schemes implement ternary quantization (values in {-1, 0, 1}), designed for models like BitNet b1.58 and TriLMs. They provide extremely efficient storage for neural networks that can operate with only three weight states.

TQ1_0

Description: A ternary quantization scheme storing values in {-1, 0, 1} with 1.6875 bits per weight. Uses a sophisticated base-3 packing scheme to achieve high compression.
Block Size: 256 floats (QK_K).
Structure:

typedef struct {
    uint8_t qs[(QK_K - 4 * QK_K / 64) / 5]; // 5 elements per byte (3^5 = 243 < 256)
    uint8_t qh[QK_K/64];                    // 4 elements per byte
    ggml_half d;                            // scale factor
} block_tq1_0;

Algorithm:
1. Find the maximum absolute value amax in the block and set scale d = amax.
2. Quantize each float to the nearest ternary value: xi = lroundf(x / d) + 1 (maps {-1, 0, 1} to {0, 1, 2}).
3. Pack 5 ternary values into each byte using base-3 encoding: q = q * 3 + xi.
4. Scale the packed byte to fit in 8 bits: q = ((uint16_t)q * 256 + 242) / 243.
5. Store most of the values in qs (5 per byte) and remaining values in qh (4 per byte).
Dequantization: Uses powers of 3 table {1, 3, 9, 27, 81, 243} to extract ternary values: xi = ((uint16_t)q * 3) >> 8, then value = (xi - 1) * d.
Implementation Notes: Does not use importance matrices. The complex packing scheme maximizes storage efficiency while allowing exact reconstruction of ternary values.

TQ2_0

Description: A ternary quantization scheme with 2.0625 bits per weight, using simpler 2-bit packing for faster processing.
Block Size: 256 floats (QK_K).
Structure:

typedef struct {
    uint8_t qs[QK_K/4]; // 2 bits per element
    ggml_half d;        // scale factor
} block_tq2_0;

Algorithm:
1. Find the maximum absolute value amax in the block and set scale d = amax.
2. Quantize each float to ternary: xi = lroundf(x / d) + 1 (maps {-1, 0, 1} to {0, 1, 2}).
3. Pack 4 ternary values into each byte using 2 bits each: q += (xi & 3) << (2*n).
Dequantization: Extract 2-bit groups: q = (qs[j] >> (l*2)) & 3, then value = (q - 1) * d.
Implementation Notes: Simpler and faster than TQ1_0 but uses slightly more space. Does not use importance matrices. The 2-bit encoding wastes one value (since only 3 states are needed out of 4 possible).

Super-Block Quantization Types (`_K` Suffix)

The _K quantization types use "super-blocks" of 256 elements (QK_K). These super-blocks are composed of smaller sub-blocks. The key innovation is that the scales and minimums of the sub-blocks are themselves quantized, leading to a higher compression ratio and better accuracy, as the scaling is more localized. The "K" in the name refers to the k-means clustering used to create the quantization tables.

Q2_K

Description: A 2-bit quantization scheme with 16 sub-blocks of 16 elements. This is a very aggressive quantization that can lead to significant quality loss.
Structure:

typedef struct {
    uint8_t scales[QK_K/16]; // scales and mins, quantized with 4 bits
    uint8_t qs[QK_K/4];      // quants
    ggml_half d;             // super-block scale for quantized scales
    ggml_half dmin;          // super-block scale for quantized mins
} block_q2_K;

Algorithm:
1. The 256-element block is divided into 16 sub-blocks of 16 floats.
2. For each sub-block, a scale and a minimum are computed using weighted quantization with importance matrix if provided.
3. These 16 4-bit scales and 16 4-bit minimums are packed into the scales array (scale in lower 4 bits, min in upper 4 bits).
4. A shared super-block scale (d = max_scale/15) and super-block minimum (dmin = max_min/15) are used to dequantize the sub-block scales and minimums.
Implementation Notes: Uses make_qkx3_quants() and make_qp_quants() for optimal scale calculation with importance weighting.

Q3_K

Description: A 3-bit, symmetric quantization scheme with 16 sub-blocks of 16 elements. It offers a balance between the high compression of Q2_K and the quality of Q4_K.
Structure:

typedef struct {
    uint8_t hmask[QK_K/8]; // quants - high bit
    uint8_t qs[QK_K/4];    // quants - low 2 bits
    uint8_t scales[12];    // scales, quantized with 6 bits
    ggml_half d;           // super-block scale
} block_q3_K;

Algorithm: Uses symmetric quantization (x = scale * q). The 16 sub-block scales are quantized to 6 bits and packed in a complex 12-byte format. The 3-bit weights are split, with the lowest 2 bits in qs and the highest bit in hmask.
Implementation Details:
- Scales are quantized to range [-32, 31] and stored with bit packing
- The hmask stores the high bit for each group of 8 quantized values
- Uses make_q3_quants() for RMSE-optimized quantization
- Dequantization involves unpacking scales and reconstructing 3-bit values from 2-bit + high-bit

Q4_K

Description: A 4-bit asymmetric scheme with 8 sub-blocks of 32 elements. This is a very popular quantization type that offers a good balance between quality and performance.
Structure:

typedef struct {
    ggml_half d;    // super-block scale for quantized scales
    ggml_half dmin; // super-block scale for quantized mins
    uint8_t scales[12]; // scales and mins, quantized with 6 bits
    uint8_t qs[QK_K/2];           // 4-bit quants
} block_q4_K;

Algorithm: Asymmetric (x = scale * q + min). It divides the super-block into 8 sub-blocks of 32 elements, quantizing their scales and mins into a 12-byte scales array.

Q5_K

Description: A 5-bit version of Q4_K, offering higher precision. This is a good choice for users who want a bit more quality than Q4_K without a large increase in file size.
Structure:

typedef struct {
    ggml_half d;
    ggml_half dmin;
    uint8_t scales[12];
    uint8_t qh[QK_K/8]; // quants, high bit
    uint8_t qs[QK_K/2]; // quants, low 4 bits
} block_q5_K;

Algorithm: Follows the Q4_K structure but accommodates 5-bit weights by adding the qh array for the 5th bit.

Q6_K

Description: The highest precision K-block type, using 6-bit symmetric quantization. This is a good choice for users who want the highest possible quality with a K-quants model.
Structure:

typedef struct {
    uint8_t ql[QK_K/2];      // quants, lower 4 bits
    uint8_t qh[QK_K/4];      // quants, upper 2 bits
    int8_t  scales[QK_K/16]; // scales, quantized with 8 bits
    ggml_half d;             // super-block scale
} block_q6_K;

Algorithm: Uses 16 sub-blocks of 16 elements. The sub-block scales are quantized to 8 bits, offering more precision for the scaling factors themselves compared to other K-types.

Importance Matrix (IQ) and Non-Linear Quantization Types

These are advanced schemes for very low bit-rates. They often use non-linear quantization grids and can leverage an "importance matrix" during quantization to selectively preserve important weights, yielding better performance. These are newer and more experimental than the other quantization types.

IQ4_NL

Description: A 4-bit non-linear quantization scheme. It uses a predefined, non-uniform grid of 16 values to better represent the typical distribution of weights.
Block Size: 32 floats.
Structure:

typedef struct {
    ggml_half d;
    uint8_t qs[16];
} block_iq4_nl;

Algorithm: Instead of linear steps, it maps floats to the closest value in a non-linear kvalues_iq4nl grid, which is then scaled by d.

IQ3_XXS

Description: A 3-bit quantization scheme using a single scale and a predefined grid.
Block Size: 256 floats.
Structure:

typedef struct {
    ggml_half d;
    uint8_t qs[96]; // 3 * 256 / 8 = 96
} block_iq3_xxs;

Algorithm: Uses a predefined grid (iq3xxs_grid) to quantize values. The 3-bit quantized integers are indices into this grid. The final value is d * grid[index].

IQ2_XXS

Description: A 2-bit quantization scheme that uses a block-level scale, a grid, and a signs table.
Block Size: 256 floats.
Structure:

typedef struct {
    ggml_half d;
    uint16_t qs[32];
} block_iq2_xxs;

Algorithm: Highly complex. Each uint16_t in qs encodes information for 8 floats. It uses parts of the bits to index into a signs table (ksigns_iq2xs) and a quantization grid (iq2xxs_grid).

IQ2_XS

Description: A 2-bit scheme that adds a layer of 4-bit sub-block scales on top of a base 2-bit quantization.
Block Size: 256 floats.
Structure:

typedef struct {
    ggml_half d;
    uint16_t qs[QK_K/8];
    uint8_t  scales[QK_K/32];
} block_iq2_xs;

Algorithm: Similar to IQ2_XXS but with an additional layer of scaling for sub-blocks of 32, providing more fine-grained quantization.

IQ2_S

Description: A 2.5625 bpw scheme with 2-bit scales for sub-blocks of 32.
Block Size: 256 floats.
Structure:

typedef struct {
    ggml_half d;
    uint8_t qs[QK_K/4];
    uint8_t qh[QK_K/32];
    uint8_t scales[QK_K/32];
} block_iq2_s;

Algorithm: Another variation of 2-bit quantization that uses a different multi-level scaling approach.

IQ3_S

Description: A higher precision 3-bit IQ scheme (3.4375 bpw).
Block Size: 256 floats.
Structure:

typedef struct {
    ggml_half d;
    uint8_t qs[QK_K/4];
    uint8_t qh[QK_K/32];
    uint8_t signs[QK_K/8];
    uint8_t scales[QK_K/64];
} block_iq3_s;

Algorithm: Uses 3-bit quantization with 2-bit scales for sub-blocks of 64. The structure is very complex, storing sign bits separately.

IQ4_XS

Description: A 4-bit scheme with 6-bit scales for sub-blocks of 16.
Block Size: 256 floats.
Structure:

typedef struct {
    ggml_half d;
    uint16_t scales_h;
    uint8_t  scales_l[QK_K/64];
    uint8_t  qs[QK_K/2];
} block_iq4_xs;

Algorithm: A 4-bit IQ scheme with a unique scaling mechanism where scales are split into high (scales_h) and low (scales_l) bits.

IQ1_S

Description: An extremely compressed 1-bit scheme (1.5625 bpw).
Block Size: 256 floats.
Structure:

typedef struct {
    ggml_half d;
    uint8_t  qs[QK_K/8];
    uint16_t qh[QK_K/32];
} block_iq1_s;

Algorithm: Uses a 1-bit value to select a level from a 2-level grid, but the grid itself is selected from a large, predefined set of grids (iq1s_grid).

IQ1_M

Description: Another 1-bit scheme (1.75 bpw), slightly larger than IQ1_S.
Block Size: 256 floats.
Structure:

typedef struct {
    uint8_t  qs[QK_K/8];
    uint8_t  qh[QK_K/16];
    uint8_t  scales[QK_K/32];
} block_iq1_m;

Algorithm: Unlike IQ1_S, it does not have a global scale d. Instead, it relies entirely on 3-bit quantized scales for its sub-blocks.

Implementation Details

Quantization Functions

Each quantization type has multiple implementation variants:

Reference implementations (_ref suffix): Deterministic implementations used for creating model files, ensuring reproducibility across platforms.
Importance-matrix implementations (_impl functions): Optimized implementations that use an importance matrix to preserve critical weights during quantization.
Main quantization functions (quantize_<type>): High-level functions that choose between reference and importance-matrix implementations based on whether an importance matrix is provided.

Error Thresholds

Different quantization types use specific epsilon thresholds for handling near-zero blocks:

#define GROUP_MAX_EPS 1e-15f
#define GROUP_MAX_EPS_IQ3_XXS 1e-8f
#define GROUP_MAX_EPS_IQ2_S 1e-8f
#define GROUP_MAX_EPS_IQ1_M 1e-7f
#define GROUP_MAX_EPS_IQ1_S 1e-12f

Helper Functions

Key internal functions used across quantization implementations:

nearest_int(): Fast integer rounding with specific bounds checking
make_qx_quants(): General quantization with RMSE optimization and weighting
make_q3_quants(): Specialized 3-bit quantization with iterative refinement
make_qkx1_quants(), make_qkx2_quants(), make_qkx3_quants(): Multi-level quantization with different optimization strategies
best_index_int8(): Binary search for finding best match in quantization grids

Bit Packing Patterns

The implementation uses sophisticated bit packing to maximize storage efficiency:

4-bit values: Two values per byte (low and high nibbles)
5-bit values: 4 bits in nibbles + 1 bit packed in separate arrays
K-quant scales: Complex packing schemes to store multiple scales and minimums in minimal space
IQ formats: Custom grids and sign/magnitude encoding for ultra-low bit rates

donohara · 2025-11-19T23:19:17Z

donohara
Nov 19, 2025

@pwilkin This is something I really appreciate. I've dipped my toe into the code, but it is easy to get overwhelmed. I use the llama.cpp project in my classes for graduates in AI to study inference engine implementations. This project is a perfect tool for me to use when working with non-programmers from the consumer-of-tools PoV, as well as experienced machine learning programmers looking to look under the hood. Thanks !!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quant type documentation #17393

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Quant type documentation #17393

Uh oh!

Uh oh!

pwilkin Nov 19, 2025 Collaborator

GGML Quantization

General Architecture

Block Size Constants

Importance Matrix Support

Standard Quantization Types

Q4_0

Q4_1

Q5_0

Q5_1

Q8_0

Q8_1

MXFP4

Ternary Quantization Types (TQ Prefix)

TQ1_0

TQ2_0

Super-Block Quantization Types (_K Suffix)

Q2_K

Q3_K

Q4_K

Q5_K

Q6_K

Importance Matrix (IQ) and Non-Linear Quantization Types

IQ4_NL

IQ3_XXS

IQ2_XXS

IQ2_XS

IQ2_S

IQ3_S

IQ4_XS

IQ1_S

IQ1_M

Implementation Details

Quantization Functions

Error Thresholds

Helper Functions

Bit Packing Patterns

Replies: 1 comment

Uh oh!

donohara Nov 19, 2025

pwilkin
Nov 19, 2025
Collaborator

Ternary Quantization Types (`TQ` Prefix)

Super-Block Quantization Types (`_K` Suffix)

donohara
Nov 19, 2025