Skip to content

Commit 5de9a4f

Browse files
Kaihui-intelpre-commit-ci[bot]changwangss
authored
Support transformers-like api for woq quantization (#1987)
Signed-off-by: Kaihui-intel <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Wang, Chang <[email protected]>
1 parent 9c39b42 commit 5de9a4f

File tree

32 files changed

+73062
-67
lines changed

32 files changed

+73062
-67
lines changed

.azure-pipelines/ut-basic.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@ pr:
1919
- neural_compressor/torch
2020
- neural_compressor/tensorflow
2121
- neural_compressor/onnxrt
22+
- neural_compressor/transformers
23+
- neural_compressor/evaluation
2224
- .azure-pipelines/scripts/ut/3x
2325

2426
pool: ICX-16C

.pre-commit-config.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -129,7 +129,8 @@ repos:
129129
examples/onnxrt/nlp/huggingface_model/text_generation/llama/quantization/ptq_static/prompt.json|
130130
examples/notebook/dynas/ResNet50_Quantiation_Search_Supernet_NAS.ipynb|
131131
examples/notebook/dynas/Transformer_LT_Supernet_NAS.ipynb|
132-
neural_compressor/torch/algorithms/fp8_quant/internal/diffusion_evaluation/SR_evaluation/imagenet1000_clsidx_to_labels.txt
132+
neural_compressor/torch/algorithms/fp8_quant/internal/diffusion_evaluation/SR_evaluation/imagenet1000_clsidx_to_labels.txt|
133+
neural_compressor/evaluation/hf_eval/datasets/cnn_validation.json
133134
)$
134135
135136
- repo: https://github.com/astral-sh/ruff-pre-commit
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
# Step-by-Step
2+
We provide a Transformers-like API for model compression using the `WeightOnlyQuant` with `Rtn/Awq/Teq/GPTQ/AutoRound` algorithms, besides we provide use ipex to use intel extension for pytorch to accelerate the model.
3+
We provide the inference benchmarking script `run_generation.py` for large language models, the default search algorithm is beam search with `num_beams = 4`. [Here](./llm_quantization_recipes.md) are some well accuracy and performance optimized models we validated, more models are working in progress.
4+
5+
# Quantization for CPU device
6+
7+
## Prerequisite​
8+
### Create Environment​
9+
python version requests equal or higher than 3.9 due to [text evaluation library](https://github.com/EleutherAI/lm-evaluation-harness/tree/master) limitation, the dependent packages are listed in requirements, we recommend create environment as the following steps.
10+
11+
```bash
12+
pip install -r requirements_cpu_woq.txt
13+
```
14+
15+
16+
### Run
17+
#### Performance
18+
```shell
19+
# fp32
20+
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generate_cpu_woq.py \
21+
--model <MODEL_NAME_OR_PATH> \
22+
--batch_size 1 \
23+
--benchmark
24+
25+
# quant and do benchmark.
26+
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generate_cpu_woq.py \
27+
--model <MODEL_NAME_OR_PATH> \
28+
--woq \
29+
--woq_algo <ALGORITHM_NAME> \ # Default is "Rtn", "Awq", "Teq", "GPTQ", "AutoRound" are provided.
30+
--output_dir <WOQ_MODEL_SAVE_PATH> \ # Default is "./saved_results"
31+
--batch_size \
32+
--benchmark
33+
34+
# load WOQ quantized model and do benchmark.
35+
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generate_cpu_woq.py \
36+
--model <WOQ_MODEL_SAVE_PATH> \
37+
--benchmark
38+
39+
# load WOQ model from Huggingface and do benchmark.
40+
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generate_cpu_woq.py \
41+
--model <MODEL_NAME_OR_PATH> \
42+
--benchmark
43+
44+
```
45+
#### Accuracy
46+
The accuracy validation is based from [lm_evaluation_harness](https://github.com/EleutherAI/lm-evaluation-harness/blob/v0.4.3/lm_eval/__main__.py).
47+
```shell
48+
# fp32
49+
python run_generate_cpu_woq.py \
50+
--model <MODEL_NAME_OR_PATH> \
51+
--accuracy \
52+
--tasks lambada_openai,piqa,hellaswag \ # notice: no space.
53+
--device cpu \
54+
--batch_size 56
55+
56+
# quant and do accuracy.
57+
python run_generate_cpu_woq.py \
58+
--model <MODEL_NAME_OR_PATH> \
59+
--woq \
60+
--woq_algo <ALGORITHM_NAME> \ # Default is "Rtn", "Awq", "Teq", "GPTQ", "AutoRound" are provided.
61+
--output_dir <WOQ_MODEL_SAVE_PATH> \
62+
--accuracy \
63+
--tasks lambada_openai,piqa,hellaswag \ # notice: no space.
64+
--batch_size 56
65+
66+
# load WOQ model quantied by itrex and do benchmark.
67+
python run_generate_cpu_woq.py \
68+
--model <WOQ_MODEL_SAVE_PATH> \
69+
--accuracy \
70+
--tasks lambada_openai,piqa,hellaswag \ # notice: no space.
71+
--batch_size 56
72+
73+
# load WOQ model quantied by itrex and do benchmark with neuralspeed.
74+
# only support quantized with algorithm "Awq", "GPTQ", "AutoRound"
75+
python run_generate_cpu_woq.py \
76+
--model <WOQ_MODEL_SAVE_PATH> \
77+
--accuracy \
78+
--tasks lambada_openai,piqa,hellaswag \ # notice: no space.
79+
--device cpu \
80+
--batch_size 56
81+
82+
83+
# load WOQ model from Huggingface and do benchmark.
84+
python run_generate_cpu_woq.py \
85+
--model <MODEL_NAME_OR_PATH> \
86+
--accuracy \
87+
--tasks lambada_openai,piqa,hellaswag \ # notice: no space.
88+
--device cpu \
89+
--batch_size 56
90+
91+
# load WOQ model from Huggingface and do benchmark with neuralspeed.
92+
python run_generate_cpu_woq.py \
93+
--model <MODEL_NAME_OR_PATH> \
94+
--accuracy \
95+
--tasks lambada_openai,piqa,hellaswag \ # notice: no space.
96+
--device cpu \
97+
--batch_size 56 \
98+
99+
```
100+
101+
# Quantization for GPU device
102+
>**Note**:
103+
> 1. default search algorithm is beam search with num_beams = 1.
104+
> 2. [ipex.optimize_transformers](https://github.com/intel/intel-extension-for-pytorch/blob/v2.1.10%2Bxpu/docs/tutorials/llm/llm_optimize_transformers.md) Support for the optimized inference of model types "gptj," "mistral," "qwen," and "llama" to achieve high performance and accuracy. Ensure accurate inference for other model types as well.
105+
> 3. We provide compression technologies `WeightOnlyQuant` with `Rtn/GPTQ/AutoRound` algorithms and `load_in_4bit` and `load_in_8bit` work on intel GPU device.
106+
107+
## Prerequisite​
108+
### Dependencies
109+
Intel-extension-for-pytorch dependencies are in oneapi package, before install intel-extension-for-pytorch, we should install oneapi first. Please refer to [Installation Guide](https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=gpu&version=v2.1.10%2Bxpu) to install the OneAPI to "/opt/intel folder".
110+
111+
### Create Environment​
112+
Pytorch and Intel-extension-for-pytorch version for intel GPU > 2.1 are required, python version requests equal or higher than 3.9 due to [text evaluation library](https://github.com/EleutherAI/lm-evaluation-harness/tree/master) limitation, the dependent packages are listed in requirements_GPU.txt, we recommend create environment as the following steps. For Intel-exension-for-pytorch, we should install from source code now, and Intel-extension-for-pytorch will add weight-only quantization in the next version.
113+
114+
>**Note**: please install transformers==4.40.2.
115+
116+
```bash
117+
pip install -r requirements_GPU.txt
118+
pip install transformers==4.38.1 # llama use 4.38.1
119+
source /opt/intel/oneapi/setvars.sh
120+
git clone https://github.com/intel/intel-extension-for-pytorch.git ipex-gpu
121+
cd ipex-gpu
122+
git submodule update --init --recursive
123+
export USE_AOT_DEVLIST='pvc,ats-m150'
124+
export BUILD_WITH_CPU=OFF
125+
126+
python setup.py install
127+
```
128+
129+
## Run
130+
The following are command to show how to use it.
131+
132+
### 1. Performance
133+
``` bash
134+
# fp16
135+
python run_generation_gpu_woq.py \
136+
--model EleutherAI/gpt-j-6b \
137+
--benchmark
138+
139+
# weightonlyquant
140+
python run_generation_gpu_woq.py \
141+
--model EleutherAI/gpt-j-6b \
142+
--woq \
143+
--woq_algo <ALGORITHM_NAME> \ # Default is "Rtn", "GPTQ", "AutoRound" are provided.
144+
--benchmark
145+
```
146+
> Note: If your device memory is not enough, please quantize and save the model first, then rerun the example with loading the model as below, If your device memory is enough, skip below instruction, just quantization and inference.
147+
```bash
148+
# First step: Quantize and save model
149+
python run_generation_gpu_woq.py \
150+
--model EleutherAI/gpt-j-6b \
151+
--woq \ # default quantize method is Rtn
152+
--woq_algo <ALGORITHM_NAME> \ # Default is "Rtn", "GPTQ", "AutoRound" are provided.
153+
--output_dir "saved_dir"
154+
155+
# Second step: Load model and inference
156+
python run_generation_gpu_woq.py \
157+
--model "saved_dir" \
158+
--benchmark
159+
```
160+
161+
### 2. Accuracy
162+
```bash
163+
# quantized model by following the steps above
164+
python run_generation_gpu_woq.py \
165+
--model "saved_dir" \
166+
--accuracy \
167+
--tasks "lambada_openai"
168+
```

0 commit comments

Comments
 (0)