Skip to content

[BUG] 3bit quant and/or inference regression vs AutoGPTQ #1278

@sidhantls

Description

@sidhantls

Describe the bug
Quantizing a model to 3 bits using this repo leads to completely deteriorated performance. On MMLU, it gets 22%. However, when I quantize it using https://github.com/AutoGPTQ/AutoGPTQ, (which is where this repo was forked from?), I get 57%. This was using Llama-3.1-8B-Instruct.

Using this repo, for 4bit I get 66% on MMLU, which is in line with what AutoGPTQ gets for 4 bits

Anyone else noticed that 3bit doesn't work here but works in AutoGPTQ ?

Software Info

Operation System/Version + Python Version
python 3.10

To Reproduce
Quantize model to 3 bits:

    calibration_dataset = load_dataset(
        "allenai/c4",
        data_files="en/c4-train.00001-of-01024.json.gz",
        split="train"
    ).select(range(1024//2))["text"]

    # calibration_dataset = [" ".join(item.split()[:20]) for item in calibration_dataset] # speedup

    quantize_config = QuantizeConfig(
        bits=3,
        group_size=64,
    )

    model = GPTQModel.load(args.model_name, quantize_config)
    tokenizer = AutoTokenizer.from_pretrained(args.model_name, cache_dir=args.cache_dir)

    # increase `batch_size` to match gpu/vram specs to speed up quantization
    model.quantize(calibration_dataset, batch_size=args.batch_size)

Expected behavior

Performance not to break on 3bits and to allign with AutoGPTQ library

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions