-
Notifications
You must be signed in to change notification settings - Fork 129
Description
Hi guys, Thank you for your efforts in implementing GPTQModel. I am the author of EoRA, a training-free method that introduces a low-rank residual path alongside quantized weights to mitigate compression errors. The overview of our approach is depicted in the figure below:
.
I have conducted experiments on compensating GPTQ model and got some really promising results where you can see that the effectiveness of our method becomes increasingly significant at lower bit-widths:
Key Advantages of EoRA:
One of the key merits of EoRA is its ability to improve the accuracy of a general quantized model on specific downstream tasks with minimal effort in terms of both number of data as well as time. For example, if a user is provided with a GPTQ model quantized using the GPTQModel framework and wants to enhance its performance on a downstream task, EoRA can achieve this using a small calibration dataset in just a few minutes—without requiring training, gradient computation, or backpropagation. This is particularly beneficial for GPTQModel as EoRA will give user one layer of flexibility in adjusting the tradeoff between downstream task accuracy and model size.
Currently, my EoRA implementation is based on the official GPTQ repository: https://github.com/IST-DASLab/gptq which I assume could also be easily integrated into the GPTQModel framework. Here's an example of how it might look:
from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig
model_id = "meta-llama/Llama-3.2-1B-Instruct"
quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit"
calibration_dataset = load_dataset(
"allenai/c4",
data_files="en/c4-train.00001-of-01024.json.gz",
split="train"
).select(range(1024))["text"]
quant_config = QuantizeConfig(bits=4, group_size=128)
model = GPTQModel.load(model_id, quant_config)
# increase `batch_size` to match gpu/vram specs to speed up quantization
model.quantize(calibration_dataset, batch_size=2)
model.save(quant_path)
# Choose calibration dataset for EoRA
eora_calibration_dataset = load_dataset("could be any downstrem dataset")
# Run EoRA
model = EoRA_calibration(model, eora_calibration_dataset)
# Save EoRA
model.save(eora_path)
# test post-quant inference
model = GPTQModel.load(eora_path)
result = model.generate("Uncovering deep insights begins with")[0]Additional Benefits of EoRA:
Multi-LoRA Inference Support:
EoRA supports multi-LoRA inference, as the underlying GPTQ model remains unchanged. Only the low-rank matrices differ for each downstream task, enabling efficient task-specific adaptations.
Optimized Kernels for Speed:
Our team has also implemented a kernel based on the GPTQ ExLlama framework, which accelerates inference for GPTQ + EoRA by up to 1.3x compared to directly running the GPTQ quantization kernel with native PyTorch low-rank matrix multiplication.
Let me know if you'd like more details or have feedback on integrating EoRA!
Considering that the code of EoRA will take more time to release due to the internal reviewing process of the company, I want to start integrating EoRA into this framework first, and if you guys are also interested, I would appreciate it if I can have a quick meeting with you guys to discuss how we should proceed and how we can work together on this feature. Please don't hesitate to contact me via email: [email protected], looking forward to your feedback.
