Skip to content

Support AutoAWQ in awq-py #4701

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
casper-hansen opened this issue Dec 30, 2023 · 7 comments
Closed

Support AutoAWQ in awq-py #4701

casper-hansen opened this issue Dec 30, 2023 · 7 comments
Labels
enhancement New feature or request stale

Comments

@casper-hansen
Copy link

Feature Description

To support the AutoAWQ models, the proposal is simple. Load the model through AutoModelForCausalLM.from_pretrained() and convert the WQLinear modules.

Motivation

AutoAWQ is an improved version of llm-awq. We have made quantizing and working with the quantized models much easier, resulting in integrations into vLLM, transformers, OpenNMT, and other frameworks. On Huggingface, you can currently find ~1200 INT4 models that are made with AutoAWQ, primarily provided by TheBloke.

AutoAWQ does not store the scales because they are redundant for running inference. Instead, we store the real quantized model weights in a one-step process. This means the process will be much easier for llama.cpp users since they can just grab a model from the hub and export it to GGUF, resulting in lower perplexity and better models to chat with.

Possible Implementation

Solution 1: One possible implementation is to unpack the weights to FP16 and convert them to GGUF. I am unsure if this will introduce any unpacking error.

  • dequantize_weights_cuda: awq_ext.dequantize_weights_cuda(qweight, scales, qzeros, 1, 0, 0, False). This is quite simple to call, just install the kernels package.
  • unpack_awq: This feature is being introduced into AutoGPTQ in order to unpack the weights of AWQ. This may be another solution for unpacking.

Other solutions include directly converting the weights to GGUF. The main problem is that the packing for AWQ models is a bit complicated, and I am not sure you can directly convert it to another format.

@casper-hansen casper-hansen added the enhancement New feature or request label Dec 30, 2023
@casper-hansen
Copy link
Author

I'm adding GGUF compatibility in casper-hansen/AutoAWQ#285. This makes awq-py largely obsolete.

@ggerganov
Copy link
Member

Ok, let us know if there is anything to assist with. When merging awg-py I assumed there would be the need to support some extra scaling tensors in the computation graphs, but AFAICT these are not actually needed since the scalings are "embedded" into the weights. So in that case, we should probably remove awq-py when you add GGUF comptability in AutoAWQ

@casper-hansen
Copy link
Author

Ok, let us know if there is anything to assist with. When merging awg-py I assumed there would be the need to support some extra scaling tensors in the computation graphs,

There are some models that need a special ScaledActivation, so some of the modifications made should be kept. This is a module that is mainly applied to models which uses the GELU function.

I may also introduce another scaling feature that is specifically related to MoE which would lower perplexity.

but AFAICT these are not actually needed since the scalings are "embedded" into the weights.

Yes, the scales are simply applied (by multiplying/dividing) to the FP16 model weights and then we use llama.cpp to quantize to the specified format.

So in that case, we should probably remove awq-py when you add GGUF comptability in AutoAWQ

From my side, it will not be needed. It’s only needed if you wish to use the original repository.

@namtranase
Copy link
Contributor

Hi @casper-hansen, I am working on the same method using your AutoAWQ, but I noticed that you have made a new PR, after this is accepted, I will change the code in awq-py to compatible with your original repo.
I'm also trying to convert the AWQ models to the GGUF format directly. Thanks for your suggestion, will try to make it work.

@casper-hansen
Copy link
Author

Hi @casper-hansen, I am working on the same method using your AutoAWQ, but I noticed that you have made a new PR, after this is accepted, I will change the code in awq-py to compatible with your original repo. I'm also trying to convert the AWQ models to the GGUF format directly. Thanks for your suggestion, will try to make it work.

Thanks for helping out with this @namtranase. I am planning to release the export functionality in 0.1.9 of AutoAWQ.

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Mar 18, 2024
Copy link
Contributor

github-actions bot commented Apr 2, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

3 participants