-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Support AutoAWQ in awq-py
#4701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm adding GGUF compatibility in casper-hansen/AutoAWQ#285. This makes |
Ok, let us know if there is anything to assist with. When merging |
There are some models that need a special ScaledActivation, so some of the modifications made should be kept. This is a module that is mainly applied to models which uses the GELU function. I may also introduce another scaling feature that is specifically related to MoE which would lower perplexity.
Yes, the scales are simply applied (by multiplying/dividing) to the FP16 model weights and then we use llama.cpp to quantize to the specified format.
From my side, it will not be needed. It’s only needed if you wish to use the original repository. |
Hi @casper-hansen, I am working on the same method using your AutoAWQ, but I noticed that you have made a new PR, after this is accepted, I will change the code in |
Thanks for helping out with this @namtranase. I am planning to release the export functionality in 0.1.9 of AutoAWQ. |
This issue is stale because it has been open for 30 days with no activity. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Feature Description
To support the AutoAWQ models, the proposal is simple. Load the model through
AutoModelForCausalLM.from_pretrained()
and convert the WQLinear modules.Motivation
AutoAWQ is an improved version of
llm-awq
. We have made quantizing and working with the quantized models much easier, resulting in integrations into vLLM, transformers, OpenNMT, and other frameworks. On Huggingface, you can currently find ~1200 INT4 models that are made with AutoAWQ, primarily provided by TheBloke.AutoAWQ does not store the scales because they are redundant for running inference. Instead, we store the real quantized model weights in a one-step process. This means the process will be much easier for
llama.cpp
users since they can just grab a model from the hub and export it to GGUF, resulting in lower perplexity and better models to chat with.Possible Implementation
Solution 1: One possible implementation is to unpack the weights to FP16 and convert them to GGUF. I am unsure if this will introduce any unpacking error.
dequantize_weights_cuda
:awq_ext.dequantize_weights_cuda(qweight, scales, qzeros, 1, 0, 0, False)
. This is quite simple to call, just install the kernels package.unpack_awq
: This feature is being introduced into AutoGPTQ in order to unpack the weights of AWQ. This may be another solution for unpacking.Other solutions include directly converting the weights to GGUF. The main problem is that the packing for AWQ models is a bit complicated, and I am not sure you can directly convert it to another format.
The text was updated successfully, but these errors were encountered: