Skip to content

Support AutoAWQ in awq-py #4701

Closed
Closed
@casper-hansen

Description

@casper-hansen

Feature Description

To support the AutoAWQ models, the proposal is simple. Load the model through AutoModelForCausalLM.from_pretrained() and convert the WQLinear modules.

Motivation

AutoAWQ is an improved version of llm-awq. We have made quantizing and working with the quantized models much easier, resulting in integrations into vLLM, transformers, OpenNMT, and other frameworks. On Huggingface, you can currently find ~1200 INT4 models that are made with AutoAWQ, primarily provided by TheBloke.

AutoAWQ does not store the scales because they are redundant for running inference. Instead, we store the real quantized model weights in a one-step process. This means the process will be much easier for llama.cpp users since they can just grab a model from the hub and export it to GGUF, resulting in lower perplexity and better models to chat with.

Possible Implementation

Solution 1: One possible implementation is to unpack the weights to FP16 and convert them to GGUF. I am unsure if this will introduce any unpacking error.

  • dequantize_weights_cuda: awq_ext.dequantize_weights_cuda(qweight, scales, qzeros, 1, 0, 0, False). This is quite simple to call, just install the kernels package.
  • unpack_awq: This feature is being introduced into AutoGPTQ in order to unpack the weights of AWQ. This may be another solution for unpacking.

Other solutions include directly converting the weights to GGUF. The main problem is that the packing for AWQ models is a bit complicated, and I am not sure you can directly convert it to another format.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions