`--quantize` is doing something surprising #788

malfet · 2024-05-13T21:04:49Z

It's either surprisingly fast or it's not quantizing all the layers, based on almost identical timing it took to quantize stories110M and llama2:

% python3 torchchat.py generate stories110M --dtype float16 --quantize '{"linear:int8": {"groupsize": 0}}' --prompt "Once upon a time," --device mps
Using device=mps 
Loading model...
Time to load model: 0.85 seconds
Quantizing the model with: {'linear:int8': {'groupsize': 0}}
Time to quantize model: 0.10 seconds

% python3 torchchat.py generate llama2 --dtype float16 --quantize '{"linear:int8": {"groupsize": 0}}' --prompt "Once upon a time," --device mps
Using device=mps 
Loading model...
Time to load model: 31.72 seconds
Quantizing the model with: {'linear:int8': {'groupsize': 0}}
Time to quantize model: 0.20 seconds

The text was updated successfully, but these errors were encountered:

Test plan: ``` % python3 torchchat.py generate llama2 --dtype float16 --quantize '{"linear:int8": {"groupsize": 0}}' --prompt "Once upon a time," --device mps Using device=mps Loading model... Time to load model: 29.03 seconds Quantizing the model with: {'linear:int8': {'groupsize': 0}} Time to quantize model: 14.37 seconds ``` Fixes #788

malfet self-assigned this May 13, 2024

malfet mentioned this issue May 13, 2024

Fix LinearInt8 recursive quantization #791

Merged

malfet closed this as completed in #791 May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`--quantize` is doing something surprising #788

`--quantize` is doing something surprising #788

malfet commented May 13, 2024

--quantize is doing something surprising #788

--quantize is doing something surprising #788

Comments

malfet commented May 13, 2024

`--quantize` is doing something surprising #788

`--quantize` is doing something surprising #788