-
Notifications
You must be signed in to change notification settings - Fork 124
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
Quantizing a model to 3 bits using this repo leads to completely deteriorated performance. On MMLU, it gets 22%. However, when I quantize it using https://github.com/AutoGPTQ/AutoGPTQ, (which is where this repo was forked from?), I get 57%. This was using Llama-3.1-8B-Instruct.
Using this repo, for 4bit I get 66% on MMLU, which is in line with what AutoGPTQ gets for 4 bits
Anyone else noticed that 3bit doesn't work here but works in AutoGPTQ ?
Software Info
Operation System/Version + Python Version
python 3.10
To Reproduce
Quantize model to 3 bits:
calibration_dataset = load_dataset(
"allenai/c4",
data_files="en/c4-train.00001-of-01024.json.gz",
split="train"
).select(range(1024//2))["text"]
# calibration_dataset = [" ".join(item.split()[:20]) for item in calibration_dataset] # speedup
quantize_config = QuantizeConfig(
bits=3,
group_size=64,
)
model = GPTQModel.load(args.model_name, quantize_config)
tokenizer = AutoTokenizer.from_pretrained(args.model_name, cache_dir=args.cache_dir)
# increase `batch_size` to match gpu/vram specs to speed up quantization
model.quantize(calibration_dataset, batch_size=args.batch_size)
Expected behavior
Performance not to break on 3bits and to allign with AutoGPTQ library
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working