Add gguf q4_k quantization #2001

jerryzh168 · 2025-04-02T05:39:36Z

Summary:
Didn't implement the algorithm to choose_qparams from gguf, since it's complicated, e.g. https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L744 and https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L827C14-L827C28

but implemented a simple choose_qparams that can fit the gguf format: Q4_K: w = q * block_scale(6-bit) + block_min(6-bit)

The goal to port gguf format here in torchao is trying to compose that with our existing accuracy preserving techniques like GPTQ, AutoRound, QAT to see if we can help improve the accuracy.

also produced https://huggingface.co/jerryzh168/phi4-mini-torchao-gguf-q4_k with this change and verified with lm-eval that it has good accuracy

Test Plan:
python test/prototype/test_gguf_quant.py

Reviewers:

Subscribers:

Tasks:

Tags:

pytorch-bot · 2025-04-02T05:39:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2001

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d4bb04d with merge base 3bbf42a ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kimishpatel · 2025-04-02T14:36:29Z

havent looked at the code, but does it also implement super block scale?

jerryzh168 · 2025-04-02T17:13:17Z

havent looked at the code, but does it also implement super block scale?

yeah this is exactly what this PR is implementing :) Q4_K quant that has two levels of quantization

torchao/quantization/quant_primitives.py

kimishpatel · 2025-04-02T21:49:36Z

test/prototype/test_gguf_quant.py

+
+import torch
+
+from torchao.prototype.quantization.gguf import (


validate this btw by actually creating a gguf for a model and then run the resulting gguf file

haven't explored how to export yet, will do in next PR

torchao/prototype/quantization/gguf/gguf_quantized_tensor.py

metascroy · 2025-04-03T00:11:03Z

torchao/prototype/quantization/gguf/gguf_quantized_tensor.py

+
+
+@dataclass
+class GGUFWeightOnlyConfig(AOBaseConfig):


Have you checked that this is generic enough to capture all of their superblock affine schemes?

If it included the number of bits per sub-block scales and mins (both are usually the same), then it would be easier to adapt to more types.

Some types use 4-bit sub-block scales and mins (Q2_K), others 6-bit (Q3_K, Q4_K, Q5_K) and others 8-bit (Q6_K).

yeah it will be easy to extend to other variations, I just start with Q4_K for now

metascroy · 2025-04-03T00:12:16Z

torchao/prototype/quantization/gguf/gguf_quantized_tensor.py

+
+    @staticmethod
+    def __new__(
+        cls,


Does GGUF have a utility that can construct their packed tensors from these values?

Not yet for k-quants, sorry. It is planned though, but it will take some time.

But it would not be a drop-in replacement here anyway since the gguf Python package uses Numpy for its calculations, not PyTorch. (at least for now)

thanks for the context @compilade, I think maybe we can try to see if we can use the export script from autoround: https://github.com/intel/auto-round/blob/eb793488638105980f18ed7a40301d29d5c75ca0/auto_round/export/export_to_gguf/convert.py#L1159-L1169

I think maybe we can try to see if we can use the export script from autoround

This will likely be very, very slow (more than 20x the C version) since they re-implement make_qkx2_quants in Numpy: https://github.com/intel/auto-round/blob/eb793488638105980f18ed7a40301d29d5c75ca0/auto_round/export/export_to_gguf/quant.py#L164.

Technically, you may not necessarily need the search for the scales and mins to pack to Q4_K, assuming you already have them and/or they are trainable parameters.

I think the way you did it with separate tensors for the scales and mins does seem appropriate to avoid some of the complexity of the packing format (since from what I understand, this is intended to be some in-memory quantization used in QAT? Do correct me if I'm wrong). (the scales and mins in Q4_K are notoriously packed together in 12 bytes, but that's not relevant if this is intended as an in-memory quantization with different kernels than the ones in ggml)

You only need to worry about the packing format if you are exporting to GGUF. (Dequantization is implemented in the upstream gguf Python package for most types (including k-quants and i-quants) already, it's only quantization which is currently limited to Q4_0, Q4_1, Q5_0, Q5_1, and Q8_0 because k-quants in Python were too slow to pack (although this could change after ggml-org/llama.cpp#12557, which might simplify the search for scales (and mins, if generalized)))

It would be possible, though, to add an API to the upstream gguf Python package which would skip the search for the scales and mins but still allow quantizing to Q4_K and similar, but I'm not sure how it would/should be used.

This will likely be very, very slow (more than 20x the C version) since they re-implement make_qkx2_quants in Numpy: intel/auto-round@eb79348/auto_round/export/export_to_gguf/quant.py#L164.
Technically, you may not necessarily need the search for the scales and mins to pack to Q4_K, assuming you already have them and/or they are trainable parameters.

yes that's correct I think, we may have to adapt that code a bit for torchao quants to work, but we'll be providing scale/min from torchao (current plan is to use GPTQ, AutoRound or QAT), so won't need to run that path, will need to make sure this path is taken instead: https://github.com/intel/auto-round/blob/eb793488638105980f18ed7a40301d29d5c75ca0/auto_round/export/export_to_gguf/quant.py#L326

if this is intended as an in-memory quantization with different kernels than the ones in ggml)

we do want to target ggml kernels in the end still, but overall goal here is try to leverage existing torchao post training accuracy preserving techniques like GPTQ, AutoRound etc. and quantization aware training techniques to see if we can help improve the accuracy of various gguf quantization schemes through composing with these existing techniques (and relying on user data).

Regarding search algorithms in gguf, yeah I feel they are quite complicated (make_qx_quant, make_qp_quant etc. I also haven't looked at imatrix as stuff), and it might be error prone for me to port them here, also I didn't see them anywhere else, @compilade can you share how these are derived at the high level, and how we can understand them better. i.e. the high level motivation/direction/constraints that you are working with, are all of them trying to minimize the rounding error? what about clamping error? why don't you use data to improve quantization accuracy?

@jerryzh168

Late reply, but maybe still useful:

Regarding search algorithms in gguf, yeah I feel they are quite complicated

Yes, they are. I would not recommend re-implementing them here (unless you really want to restructure the search algorithms to make them more PyTorch-friendly to make them fast enough (which will need to be done at some point for the gguf Python library in the llama.cpp repo)).
Packing, on the other hand, is more approachable.

can you share how these are derived at the high level, and how we can understand them better. i.e. the high level motivation/direction/constraints that you are working with

These search algorithms, from my understanding, are weighted rounding algorithms, which search for the best integers which, when scaled (and/or offset, depending on the type), minimize the weighted squared error.

Scales make the norm of the quantized search space irrelevant because it can always be adjusted. The grid of integers can be traversed linearly to cumulatively find the best approximation (conceptually by trying all pre-rounding scales which would result in a distinct rounding).
Offsets project the search space on the hyperplane perpendicular to [1, 1, 1, ...] (aka, the mean is zero-ed), which makes it a bit more complicated to traverse.

are all of them trying to minimize the rounding error? what about clamping error?

The search algorithms for the scales and mins are minimizing the weighted error. By extension, this also includes rounding error and clamping error.

why don't you use data to improve quantization accuracy?

This is the purpose of imatrix, which basically allows to make the search algorithms activation-aware through the weights of the error (with per-channel importance or similar).

Summary: Didn't implement the algorithm to choose_qparams from gguf, since it's complicated, e.g. https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L744 and https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L827C14-L827C28 but implemented a simple choose_qparams that can fit the gguf format: Q4_K: w = q * block_scale(6-bit) + block_min(6-bit) Test Plan: python test/prototype/test_gguf_quant.py Reviewers: Subscribers: Tasks: Tags:

jerryzh168 · 2025-04-08T18:06:26Z

thanks, I'll merge now since CI is happy, will add more docs next time

* Add gguf q4_k_s quantization Summary: Didn't implement the algorithm to choose_qparams from gguf, since it's complicated, e.g. https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L744 and https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L827C14-L827C28 but implemented a simple choose_qparams that can fit the gguf format: Q4_K: w = q * block_scale(6-bit) + block_min(6-bit) Test Plan: python test/prototype/test_gguf_quant.py Reviewers: Subscribers: Tasks: Tags: * fix * test with phi4 * pre-commit run * update * run precommit * format

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 2, 2025

jerryzh168 requested review from mergennachin, kimishpatel, metascroy and drisspg April 2, 2025 05:39

jerryzh168 changed the title ~~Add gguf q4_k_s quantization~~ Add gguf q4_k quantization Apr 2, 2025

jerryzh168 requested a review from larryliu0820 April 2, 2025 05:41

jerryzh168 added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Apr 2, 2025

jerryzh168 force-pushed the gguf_q4_k branch from 1dc5fe0 to aa18642 Compare April 2, 2025 05:44

kimishpatel reviewed Apr 2, 2025

View reviewed changes

torchao/quantization/quant_primitives.py Outdated Show resolved Hide resolved

kimishpatel reviewed Apr 2, 2025

View reviewed changes

metascroy reviewed Apr 3, 2025

View reviewed changes

torchao/prototype/quantization/gguf/gguf_quantized_tensor.py Outdated Show resolved Hide resolved

metascroy reviewed Apr 3, 2025

View reviewed changes

jerryzh168 added 2 commits April 4, 2025 15:30

fix

163267d

jerryzh168 force-pushed the gguf_q4_k branch from 8132fec to 163267d Compare April 4, 2025 22:30

jerryzh168 added 5 commits April 5, 2025 22:22

test with phi4

7e1e019

pre-commit run

36432d3

update

afff712

run precommit

63d8d5a

format

d4bb04d

metascroy approved these changes Apr 8, 2025

View reviewed changes

jerryzh168 merged commit ef10f34 into pytorch:main Apr 8, 2025
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gguf q4_k quantization #2001

Add gguf q4_k quantization #2001

jerryzh168 commented Apr 2, 2025 •

edited

Loading

pytorch-bot bot commented Apr 2, 2025 •

edited

Loading

kimishpatel commented Apr 2, 2025

jerryzh168 commented Apr 2, 2025 •

edited

Loading

kimishpatel Apr 2, 2025

jerryzh168 Apr 4, 2025

metascroy Apr 3, 2025

compilade Apr 4, 2025

jerryzh168 Apr 4, 2025

metascroy Apr 3, 2025

compilade Apr 4, 2025

jerryzh168 Apr 4, 2025

compilade Apr 4, 2025

jerryzh168 Apr 4, 2025 •

edited

Loading

compilade Apr 24, 2025

jerryzh168 commented Apr 8, 2025


		import torch

		from torchao.prototype.quantization.gguf import (



		@dataclass
		class GGUFWeightOnlyConfig(AOBaseConfig):

Add gguf q4_k quantization #2001

Add gguf q4_k quantization #2001

Conversation

jerryzh168 commented Apr 2, 2025 • edited Loading

pytorch-bot bot commented Apr 2, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2001

✅ No Failures

kimishpatel commented Apr 2, 2025

jerryzh168 commented Apr 2, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryzh168 Apr 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryzh168 commented Apr 8, 2025

jerryzh168 commented Apr 2, 2025 •

edited

Loading

pytorch-bot bot commented Apr 2, 2025 •

edited

Loading

jerryzh168 commented Apr 2, 2025 •

edited

Loading

jerryzh168 Apr 4, 2025 •

edited

Loading