Description
Feature Description
To support the AutoAWQ models, the proposal is simple. Load the model through AutoModelForCausalLM.from_pretrained()
and convert the WQLinear modules.
Motivation
AutoAWQ is an improved version of llm-awq
. We have made quantizing and working with the quantized models much easier, resulting in integrations into vLLM, transformers, OpenNMT, and other frameworks. On Huggingface, you can currently find ~1200 INT4 models that are made with AutoAWQ, primarily provided by TheBloke.
AutoAWQ does not store the scales because they are redundant for running inference. Instead, we store the real quantized model weights in a one-step process. This means the process will be much easier for llama.cpp
users since they can just grab a model from the hub and export it to GGUF, resulting in lower perplexity and better models to chat with.
Possible Implementation
Solution 1: One possible implementation is to unpack the weights to FP16 and convert them to GGUF. I am unsure if this will introduce any unpacking error.
dequantize_weights_cuda
:awq_ext.dequantize_weights_cuda(qweight, scales, qzeros, 1, 0, 0, False)
. This is quite simple to call, just install the kernels package.unpack_awq
: This feature is being introduced into AutoGPTQ in order to unpack the weights of AWQ. This may be another solution for unpacking.
Other solutions include directly converting the weights to GGUF. The main problem is that the packing for AWQ models is a bit complicated, and I am not sure you can directly convert it to another format.