[BUG] 3bit quant and/or inference regression vs AutoGPTQ

**Describe the bug**
Quantizing a model to 3 bits using this repo leads to completely deteriorated performance. On MMLU, it gets 22%. However, when I quantize it using https://github.com/AutoGPTQ/AutoGPTQ, (which is where this repo was forked from?), I get 57%. This was using Llama-3.1-8B-Instruct.

Using this repo, for 4bit I get 66% on MMLU, which is in line with what AutoGPTQ gets for 4 bits

Anyone else noticed that 3bit doesn't work here but works in AutoGPTQ ? 


**Software Info**

Operation System/Version + Python Version
python 3.10 



**To Reproduce**
Quantize model to 3 bits:

```
    calibration_dataset = load_dataset(
        "allenai/c4",
        data_files="en/c4-train.00001-of-01024.json.gz",
        split="train"
    ).select(range(1024//2))["text"]

    # calibration_dataset = [" ".join(item.split()[:20]) for item in calibration_dataset] # speedup

    quantize_config = QuantizeConfig(
        bits=3,
        group_size=64,
    )

    model = GPTQModel.load(args.model_name, quantize_config)
    tokenizer = AutoTokenizer.from_pretrained(args.model_name, cache_dir=args.cache_dir)

    # increase `batch_size` to match gpu/vram specs to speed up quantization
    model.quantize(calibration_dataset, batch_size=args.batch_size)
```

**Expected behavior**

Performance not to break on 3bits and to allign with AutoGPTQ library 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] 3bit quant and/or inference regression vs AutoGPTQ #1278

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] 3bit quant and/or inference regression vs AutoGPTQ #1278

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions