-
Notifications
You must be signed in to change notification settings - Fork 262
Add gguf q4_k quantization #2001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2001
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit d4bb04d with merge base 3bbf42a ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
havent looked at the code, but does it also implement super block scale? |
yeah this is exactly what this PR is implementing :) Q4_K quant that has two levels of quantization |
|
||
import torch | ||
|
||
from torchao.prototype.quantization.gguf import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validate this btw by actually creating a gguf for a model and then run the resulting gguf file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haven't explored how to export yet, will do in next PR
|
||
|
||
@dataclass | ||
class GGUFWeightOnlyConfig(AOBaseConfig): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you checked that this is generic enough to capture all of their superblock affine schemes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it included the number of bits per sub-block scales and mins (both are usually the same), then it would be easier to adapt to more types.
Some types use 4-bit sub-block scales and mins (Q2_K
), others 6-bit (Q3_K
, Q4_K
, Q5_K
) and others 8-bit (Q6_K
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it will be easy to extend to other variations, I just start with Q4_K for now
|
||
@staticmethod | ||
def __new__( | ||
cls, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does GGUF have a utility that can construct their packed tensors from these values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet for k-quants, sorry. It is planned though, but it will take some time.
But it would not be a drop-in replacement here anyway since the gguf
Python package uses Numpy for its calculations, not PyTorch. (at least for now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the context @compilade, I think maybe we can try to see if we can use the export script from autoround: https://github.com/intel/auto-round/blob/eb793488638105980f18ed7a40301d29d5c75ca0/auto_round/export/export_to_gguf/convert.py#L1159-L1169
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe we can try to see if we can use the export script from autoround
This will likely be very, very slow (more than 20x the C version) since they re-implement make_qkx2_quants
in Numpy: https://github.com/intel/auto-round/blob/eb793488638105980f18ed7a40301d29d5c75ca0/auto_round/export/export_to_gguf/quant.py#L164.
Technically, you may not necessarily need the search for the scales and mins to pack to Q4_K
, assuming you already have them and/or they are trainable parameters.
I think the way you did it with separate tensors for the scales and mins does seem appropriate to avoid some of the complexity of the packing format (since from what I understand, this is intended to be some in-memory quantization used in QAT? Do correct me if I'm wrong). (the scales and mins in Q4_K
are notoriously packed together in 12 bytes, but that's not relevant if this is intended as an in-memory quantization with different kernels than the ones in ggml
)
You only need to worry about the packing format if you are exporting to GGUF. (Dequantization is implemented in the upstream gguf
Python package for most types (including k-quants and i-quants) already, it's only quantization which is currently limited to Q4_0
, Q4_1
, Q5_0
, Q5_1
, and Q8_0
because k-quants in Python were too slow to pack (although this could change after ggml-org/llama.cpp#12557, which might simplify the search for scales (and mins, if generalized)))
It would be possible, though, to add an API to the upstream gguf
Python package which would skip the search for the scales and mins but still allow quantizing to Q4_K
and similar, but I'm not sure how it would/should be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will likely be very, very slow (more than 20x the C version) since they re-implement make_qkx2_quants in Numpy: intel/auto-round@eb79348/auto_round/export/export_to_gguf/quant.py#L164.
Technically, you may not necessarily need the search for the scales and mins to pack to Q4_K, assuming you already have them and/or they are trainable parameters.
yes that's correct I think, we may have to adapt that code a bit for torchao quants to work, but we'll be providing scale/min from torchao (current plan is to use GPTQ, AutoRound or QAT), so won't need to run that path, will need to make sure this path is taken instead: https://github.com/intel/auto-round/blob/eb793488638105980f18ed7a40301d29d5c75ca0/auto_round/export/export_to_gguf/quant.py#L326
if this is intended as an in-memory quantization with different kernels than the ones in ggml)
we do want to target ggml kernels in the end still, but overall goal here is try to leverage existing torchao post training accuracy preserving techniques like GPTQ, AutoRound etc. and quantization aware training techniques to see if we can help improve the accuracy of various gguf quantization schemes through composing with these existing techniques (and relying on user data).
Regarding search algorithms in gguf, yeah I feel they are quite complicated (make_qx_quant, make_qp_quant etc. I also haven't looked at imatrix as stuff), and it might be error prone for me to port them here, also I didn't see them anywhere else, @compilade can you share how these are derived at the high level, and how we can understand them better. i.e. the high level motivation/direction/constraints that you are working with, are all of them trying to minimize the rounding error? what about clamping error? why don't you use data to improve quantization accuracy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Late reply, but maybe still useful:
Regarding search algorithms in gguf, yeah I feel they are quite complicated
Yes, they are. I would not recommend re-implementing them here (unless you really want to restructure the search algorithms to make them more PyTorch-friendly to make them fast enough (which will need to be done at some point for the gguf
Python library in the llama.cpp
repo)).
Packing, on the other hand, is more approachable.
can you share how these are derived at the high level, and how we can understand them better. i.e. the high level motivation/direction/constraints that you are working with
These search algorithms, from my understanding, are weighted rounding algorithms, which search for the best integers which, when scaled (and/or offset, depending on the type), minimize the weighted squared error.
Scales make the norm of the quantized search space irrelevant because it can always be adjusted. The grid of integers can be traversed linearly to cumulatively find the best approximation (conceptually by trying all pre-rounding scales which would result in a distinct rounding).
Offsets project the search space on the hyperplane perpendicular to [1, 1, 1, ...]
(aka, the mean is zero-ed), which makes it a bit more complicated to traverse.
are all of them trying to minimize the rounding error? what about clamping error?
The search algorithms for the scales and mins are minimizing the weighted error. By extension, this also includes rounding error and clamping error.
why don't you use data to improve quantization accuracy?
This is the purpose of imatrix
, which basically allows to make the search algorithms activation-aware through the weights of the error (with per-channel importance or similar).
Summary: Didn't implement the algorithm to choose_qparams from gguf, since it's complicated, e.g. https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L744 and https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L827C14-L827C28 but implemented a simple choose_qparams that can fit the gguf format: Q4_K: w = q * block_scale(6-bit) + block_min(6-bit) Test Plan: python test/prototype/test_gguf_quant.py Reviewers: Subscribers: Tasks: Tags:
thanks, I'll merge now since CI is happy, will add more docs next time |
* Add gguf q4_k_s quantization Summary: Didn't implement the algorithm to choose_qparams from gguf, since it's complicated, e.g. https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L744 and https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L827C14-L827C28 but implemented a simple choose_qparams that can fit the gguf format: Q4_K: w = q * block_scale(6-bit) + block_min(6-bit) Test Plan: python test/prototype/test_gguf_quant.py Reviewers: Subscribers: Tasks: Tags: * fix * test with phi4 * pre-commit run * update * run precommit * format
Summary:
Didn't implement the algorithm to choose_qparams from gguf, since it's complicated, e.g. https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L744 and https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L827C14-L827C28
but implemented a simple choose_qparams that can fit the gguf format: Q4_K: w = q * block_scale(6-bit) + block_min(6-bit)
The goal to port gguf format here in torchao is trying to compose that with our existing accuracy preserving techniques like GPTQ, AutoRound, QAT to see if we can help improve the accuracy.
also produced https://huggingface.co/jerryzh168/phi4-mini-torchao-gguf-q4_k with this change and verified with lm-eval that it has good accuracy
Test Plan:
python test/prototype/test_gguf_quant.py
Reviewers:
Subscribers:
Tasks:
Tags: