Skip to content

Commit 95197d1

Browse files
xin3heYantom1RonBenMosheHabanaulivneHolyFalafel
authored
Cherry pick v1.17.0 (#1964)
* [SW-184941] INC CI, CD and Promotion Change-Id: I60c420f9776e1bdab7bb9e02e5bcbdb6891bfe52 * [SW-183320]updated setup.py Change-Id: I592af89486cb1d9e0b5197521c428920197a9103 * [SW-177474] add HQT FP8 porting code Change-Id: I4676f13a5ed43c444f2ec68675cc41335e7234dd Signed-off-by: Zhou Yuwen <[email protected]> * [SW-189361] Fix white list extend Change-Id: Ic2021c248798fce37710d28014a6d59259c868a3 * [SW-191317] Raise exception according to hqt config object Change-Id: I06ba8fa912c811c88912987c11e5c12ef328348a * [SW-184714] Port HQT code into INC HQT lib content was copied as is under fp8_quant Tests were copied to 3.x torch location Change-Id: Iec6e1fa7ac4bf1df1c95b429524c40e32bc13ac9 * [SW-184714] Add internal folder to fp8 quant This is a folder used for experiments, not to be used by users Change-Id: I9e221ae582794e304e95392c0f37638f7bce69bc * [SW-177468] Removed unused code + cleanup Change-Id: I4d27c067e87c1a30eb1da9df16a16c46d092c638 * Fix errors in regression_detection Change-Id: Iee5318bd5593ba349812516eb5641958ece3c438 * [SW-187731] Save orig module as member of patched module This allows direct usage of the original module methods, which solves torch compile issue Change-Id: I464d8bd1bacdfc3cd1f128a67114e1e43f092632 * [SW-190899] Install packages according to configuration Change-Id: I570b490658f5d2c5399ba1db93f8f52f56449525 * [SW-184689] use finalize_calibration intrenaly for one step flow Change-Id: Ie0b8b426c951cf57ed7e6e678c86813fb2d05c89 * [SW-191945] align requirement_pt.txt in gerrit INC with Github INC Change-Id: If5c0dbf21bf989af37a8e29246e4f8760cd215ef Signed-off-by: xinhe3 <[email protected]> * [SW-192358] Remove HQT reference in INC Change-Id: Ic25f9323486596fa2dc6d909cd568a37ab84dd5e * [SW-191415] update fp8 maxAbs observer using torch.copy_ Change-Id: I3923c832f9a8a2b14e392f3f4719d233a457702f * [SW-184943] Enhance INC WOQ model loading - Support loading huggingface WOQ model - Abstract WeightOnlyLinear base class. Add INCWeightOnlyLinear and HPUWeighOnlyLinear subclasses - Load woq linear weight module by module - Save hpu format tensor to reuse it once load it again Change-Id: I679a42759b49e1f45f52bbb0bdae8580a23d0bcf * [SW-190303] Implement HPUWeightOnlyLinear class in INC Change-Id: Ie05c8787e708e2c3559dce24ef0758d6c498ac41 * [SW-192809] fix json_file bug when instantiating FP8Config class Change-Id: I4a715d0a706efe20ccdb49033755cabbc729ccdc Signed-off-by: Zhou Yuwen <[email protected]> * [SW-192931] align setup.py with github INC and remove fp8_convert Change-Id: Ibbc157646cfcfad64b323ecfd96b9bbda5ba9e2f Signed-off-by: xinhe3 <[email protected]> * [SW-192917] Update all HQT logic files with pre-commit check Change-Id: I119dc8578cb10932fd1a8a674a8bdbf61f978e42 Signed-off-by: xinhe3 <[email protected]> * update docstring Signed-off-by: yuwenzho <[email protected]> * add fp8 example and document (#1639) Signed-off-by: xinhe3 <[email protected]> * Update settings to be compatible with gerrit * enhance ut Signed-off-by: yuwenzho <[email protected]> * move fp8 sample to helloworld folder Signed-off-by: yuwenzho <[email protected]> * update torch version of habana docker Signed-off-by: xinhe3 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update readme demo Signed-off-by: xinhe3 <[email protected]> * update WeightOnlyLinear to INCWeightOnlyLinear Signed-off-by: xinhe3 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add docstring for FP8Config Signed-off-by: xinhe3 <[email protected]> * fix pylint Signed-off-by: xinhe3 <[email protected]> * update fp8 test scripts Signed-off-by: chensuyue <[email protected]> * delete deps Signed-off-by: chensuyue <[email protected]> * update container into v1.17.0 Signed-off-by: chensuyue <[email protected]> * update docker version Signed-off-by: xinhe3 <[email protected]> * update pt ut Signed-off-by: chensuyue <[email protected]> * add lib path Signed-off-by: chensuyue <[email protected]> * fix dir issue Signed-off-by: xinhe3 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update fp8 test scope Signed-off-by: chensuyue <[email protected]> * fix typo Signed-off-by: xinhe3 <[email protected]> * update fp8 test scope Signed-off-by: chensuyue <[email protected]> * update pre-commit-ci Signed-off-by: chensuyue <[email protected]> * work around for hpu Signed-off-by: xinhe3 <[email protected]> * fix UT Signed-off-by: xinhe3 <[email protected]> * fix parameter Signed-off-by: chensuyue <[email protected]> * omit some test Signed-off-by: chensuyue <[email protected]> * update main page example to llm loading Signed-off-by: xinhe3 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix autotune Signed-off-by: xinhe3 <[email protected]> --------- Signed-off-by: Zhou Yuwen <[email protected]> Signed-off-by: xinhe3 <[email protected]> Signed-off-by: yuwenzho <[email protected]> Signed-off-by: chensuyue <[email protected]> Co-authored-by: yan tomsinsky <[email protected]> Co-authored-by: Ron Ben Moshe <[email protected]> Co-authored-by: Uri Livne <[email protected]> Co-authored-by: Danny Semiat <[email protected]> Co-authored-by: smarkovichgolan <[email protected]> Co-authored-by: Dudi Lester <[email protected]>
1 parent de0fa21 commit 95197d1

File tree

123 files changed

+8905
-5804
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

123 files changed

+8905
-5804
lines changed

.azure-pipelines/scripts/codeScan/pydocstyle/scan_path.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,4 +25,5 @@
2525
/neural-compressor/neural_compressor/torch/algorithms/static_quant
2626
/neural-compressor/neural_compressor/torch/algorithms/weight_only
2727
/neural-compressor/neural_compressor/torch/export
28+
/neural-compressor/neural_compressor/torch/quantization
2829
/neural-compressor/neural_compressor/torch/utils

.azure-pipelines/scripts/ut/3x/coverage.3x_pt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ include =
66
*/neural_compressor/common/*
77
*/neural_compressor/torch/*
88
omit =
9-
*/neural_compressor/torch/algorithms/habana_fp8/*
9+
*/neural_compressor/torch/algorithms/fp8_quant/*
1010
*/neural_compressor/torch/amp/*
1111
exclude_lines =
1212
pragma: no cover

.azure-pipelines/scripts/ut/3x/coverage.3x_pt_fp8

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,7 @@ branch = True
33

44
[report]
55
include =
6-
*/neural_compressor/torch/algorithms/habana_fp8/*
7-
*/neural_compressor/torch/amp/*
6+
*/neural_compressor/torch/algorithms/fp8_quant/*
87
exclude_lines =
98
pragma: no cover
109
raise NotImplementedError

.azure-pipelines/scripts/ut/3x/run_3x_pt.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@ export COVERAGE_RCFILE=/neural-compressor/.azure-pipelines/scripts/ut/3x/coverag
1515
inc_path=$(python -c 'import neural_compressor; print(neural_compressor.__path__[0])')
1616
cd /neural-compressor/test/3x || exit 1
1717
rm -rf tensorflow
18-
rm -rf onnxrt
1918
rm -rf torch/algorithms/fp8_quant
19+
rm -rf torch/quantization/fp8_quant
2020

2121
LOG_DIR=/neural-compressor/log_dir
2222
mkdir -p ${LOG_DIR}

.azure-pipelines/scripts/ut/3x/run_3x_pt_fp8.sh

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,13 @@ echo "${test_case}"
55

66
# install requirements
77
echo "set up UT env..."
8+
export LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH
89
sed -i '/^intel_extension_for_pytorch/d' /neural-compressor/test/3x/torch/requirements.txt
910
pip install -r /neural-compressor/test/3x/torch/requirements.txt
1011
pip install git+https://github.com/HabanaAI/[email protected]
1112
pip install pytest-cov
1213
pip install pytest-html
14+
pip install pytest-html-merger
1315
pip list
1416

1517
export COVERAGE_RCFILE=/neural-compressor/.azure-pipelines/scripts/ut/3x/coverage.3x_pt_fp8
@@ -19,8 +21,13 @@ cd /neural-compressor/test/3x || exit 1
1921
LOG_DIR=/neural-compressor/log_dir
2022
mkdir -p ${LOG_DIR}
2123
ut_log_name=${LOG_DIR}/ut_3x_pt_fp8.log
22-
pytest --cov="${inc_path}" -vs --disable-warnings --html=report.html --self-contained-html torch/algorithms/fp8_quant 2>&1 | tee -a ${ut_log_name}
24+
pytest --cov="${inc_path}" -vs --disable-warnings --html=report_1.html --self-contained-html torch/quantization/weight_only/test_load.py 2>&1 | tee -a ${ut_log_name}
25+
pytest --cov="${inc_path}" -vs --disable-warnings --html=report_2.html --self-contained-html torch/quantization/weight_only/test_rtn.py 2>&1 | tee -a ${ut_log_name}
26+
# pytest --cov="${inc_path}" -vs --disable-warnings --html=report_3.html --self-contained-html torch/quantization/weight_only/test_autoround.py 2>&1 | tee -a ${ut_log_name}
27+
pytest --cov="${inc_path}" -vs --disable-warnings --html=report_4.html --self-contained-html torch/quantization/fp8_quant 2>&1 | tee -a ${ut_log_name}
2328

29+
mkdir -p report && mv *.html report
30+
pytest_html_merger -i ./report -o ./report.html
2431
cp report.html ${LOG_DIR}/
2532

2633
if [ $(grep -c '== FAILURES ==' ${ut_log_name}) != 0 ] || [ $(grep -c '== ERRORS ==' ${ut_log_name}) != 0 ] || [ $(grep -c ' passed' ${ut_log_name}) == 0 ]; then

.azure-pipelines/template/docker-template.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ steps:
7474
7575
- ${{ if eq(parameters.imageSource, 'pull') }}:
7676
- script: |
77-
docker pull vault.habana.ai/gaudi-docker/1.16.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:latest
77+
docker pull vault.habana.ai/gaudi-docker/1.17.0/ubuntu22.04/habanalabs/pytorch-installer-2.3.1:latest
7878
displayName: "Pull habana docker image"
7979
8080
- script: |
@@ -95,7 +95,7 @@ steps:
9595
else
9696
docker run -dit --disable-content-trust --privileged --name=${{ parameters.containerName }} --shm-size="2g" \
9797
--runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host \
98-
-v ${BUILD_SOURCESDIRECTORY}:/neural-compressor vault.habana.ai/gaudi-docker/1.16.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:latest
98+
-v ${BUILD_SOURCESDIRECTORY}:/neural-compressor vault.habana.ai/gaudi-docker/1.17.0/ubuntu22.04/habanalabs/pytorch-installer-2.3.1:latest
9999
fi
100100
echo "Show the container list after docker run ... "
101101
docker ps -a

.azure-pipelines/ut-3x-pt-fp8.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,12 @@ pr:
1010
include:
1111
- .azure-pipelines/scripts/ut/3x/run_3x_pt_fp8.sh
1212
- .azure-pipelines/ut-3x-pt-fp8.yml
13+
- neural_compressor/common
14+
- neural_compressor/torch
15+
- test/3x/torch/algorithms/fp8_quant
16+
- test/3x/torch/quantization/fp8_quant
17+
- setup.py
18+
- requirements_pt.txt
1319

1420
pool: GAUDI
1521

.pre-commit-config.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -128,7 +128,8 @@ repos:
128128
examples/.*(txt|patch)|
129129
examples/onnxrt/nlp/huggingface_model/text_generation/llama/quantization/ptq_static/prompt.json|
130130
examples/notebook/dynas/ResNet50_Quantiation_Search_Supernet_NAS.ipynb|
131-
examples/notebook/dynas/Transformer_LT_Supernet_NAS.ipynb
131+
examples/notebook/dynas/Transformer_LT_Supernet_NAS.ipynb|
132+
neural_compressor/torch/algorithms/fp8_quant/internal/diffusion_evaluation/SR_evaluation/imagenet1000_clsidx_to_labels.txt
132133
)$
133134
134135
- repo: https://github.com/astral-sh/ruff-pre-commit

README.md

Lines changed: 32 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -71,66 +71,50 @@ pip install "neural-compressor>=2.3" "transformers>=4.34.0" torch torchvision
7171
```
7272
After successfully installing these packages, try your first quantization program.
7373

74-
### Weight-Only Quantization (LLMs)
75-
Following example code demonstrates Weight-Only Quantization on LLMs, it supports Intel CPU, Intel Gaudi2 AI Accelerator, Nvidia GPU, best device will be selected automatically.
74+
### [FP8 Quantization](./examples/3.x_api/pytorch/cv/fp8_quant/)
75+
Following example code demonstrates FP8 Quantization, it is supported by Intel Gaudi2 AI Accelerator.
7676

7777
To try on Intel Gaudi2, docker image with Gaudi Software Stack is recommended, please refer to following script for environment setup. More details can be found in [Gaudi Guide](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#launch-docker-image-that-was-built).
7878
```bash
7979
# Run a container with an interactive shell
80-
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.14.0/ubuntu22.04/habanalabs/pytorch-installer-2.1.1:latest
81-
82-
# Install the optimum-habana
83-
pip install --upgrade-strategy eager optimum[habana]
84-
85-
# Install INC/auto_round
86-
pip install neural-compressor auto_round
80+
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.17.0/ubuntu22.04/habanalabs/pytorch-installer-2.3.1:latest
8781
```
8882
Run the example:
8983
```python
90-
from transformers import AutoModel, AutoTokenizer
91-
92-
from neural_compressor.config import PostTrainingQuantConfig
93-
from neural_compressor.quantization import fit
94-
from neural_compressor.adaptor.torch_utils.auto_round import get_dataloader
95-
96-
model_name = "EleutherAI/gpt-neo-125m"
97-
float_model = AutoModel.from_pretrained(model_name)
98-
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
99-
dataloader = get_dataloader(tokenizer, seqlen=2048)
100-
101-
woq_conf = PostTrainingQuantConfig(
102-
approach="weight_only",
103-
op_type_dict={
104-
".*": { # match all ops
105-
"weight": {
106-
"dtype": "int",
107-
"bits": 4,
108-
"algorithm": "AUTOROUND",
109-
},
110-
}
111-
},
84+
from neural_compressor.torch.quantization import (
85+
FP8Config,
86+
prepare,
87+
convert,
11288
)
113-
quantized_model = fit(model=float_model, conf=woq_conf, calib_dataloader=dataloader)
89+
import torchvision.models as models
90+
91+
model = models.resnet18()
92+
qconfig = FP8Config(fp8_config="E4M3")
93+
model = prepare(model, qconfig)
94+
# customer defined calibration
95+
calib_func(model)
96+
model = convert(model)
11497
```
115-
**Note:**
11698

117-
To try INT4 model inference, please directly use [Intel Extension for Transformers](https://github.com/intel/intel-extension-for-transformers), which leverages Intel Neural Compressor for model quantization.
99+
### Weight-Only Large Language Model Loading (LLMs)
118100

119-
### Static Quantization (Non-LLMs)
101+
Following example code demonstrates weight-only large language model loading on Intel Gaudi2 AI Accelerator.
120102

121103
```python
122-
from torchvision import models
104+
from neural_compressor.torch.quantization import load
105+
106+
model_name = "TheBloke/Llama-2-7B-GPTQ"
107+
model = load(
108+
model_name_or_path=model_name,
109+
format="huggingface",
110+
device="hpu",
111+
torch_dtype=torch.bfloat16,
112+
)
113+
```
123114

124-
from neural_compressor.config import PostTrainingQuantConfig
125-
from neural_compressor.data import DataLoader, Datasets
126-
from neural_compressor.quantization import fit
115+
**Note:**
127116

128-
float_model = models.resnet18()
129-
dataset = Datasets("pytorch")["dummy"](shape=(1, 3, 224, 224))
130-
calib_dataloader = DataLoader(framework="pytorch", dataset=dataset)
131-
static_quant_conf = PostTrainingQuantConfig()
132-
quantized_model = fit(model=float_model, conf=static_quant_conf, calib_dataloader=calib_dataloader)
133-
```
117+
Intel Neural Compressor will convert the model format from auto-gptq to hpu format on the first load and save hpu_model.safetensors to the local cache directory for the next load. So it may take a while to load for the first time.
134118

135119
## Documentation
136120

@@ -157,12 +141,13 @@ quantized_model = fit(model=float_model, conf=static_quant_conf, calib_dataloade
157141
<tbody>
158142
<tr>
159143
<td colspan="2" align="center"><a href="./docs/source/3x/PyTorch.md">Overview</a></td>
160-
<td colspan="2" align="center"><a href="./docs/source/3x/PT_StaticQuant.md">Static Quantization</a></td>
161144
<td colspan="2" align="center"><a href="./docs/source/3x/PT_DynamicQuant.md">Dynamic Quantization</a></td>
145+
<td colspan="2" align="center"><a href="./docs/source/3x/PT_StaticQuant.md">Static Quantization</a></td>
162146
<td colspan="2" align="center"><a href="./docs/source/3x/PT_SmoothQuant.md">Smooth Quantization</a></td>
163147
</tr>
164148
<tr>
165-
<td colspan="4" align="center"><a href="./docs/source/3x/PT_WeightOnlyQuant.md">Weight-Only Quantization</a></td>
149+
<td colspan="2" align="center"><a href="./docs/source/3x/PT_WeightOnlyQuant.md">Weight-Only Quantization</a></td>
150+
<td colspan="2" align="center"><a href="./docs/3x/PT_FP8Quant.md">FP8 Quantization</a></td>
166151
<td colspan="2" align="center"><a href="./docs/source/3x/PT_MXQuant.md">MX Quantization</a></td>
167152
<td colspan="2" align="center"><a href="./docs/source/3x/PT_MixedPrecision.md">Mixed Precision</a></td>
168153
</tr>

docs/3x/PT_FP8Quant.md

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
FP8 Quantization
2+
=======
3+
4+
1. [Introduction](#introduction)
5+
2. [Supported Parameters](#supported-parameters)
6+
3. [Get Start with FP8 Quantization](#get-start-with-fp8-quantization)
7+
4. [Examples](#examples)
8+
9+
## Introduction
10+
11+
Float point 8 (FP8) is a promising data type for low precision quantization which provides a data distribution that is completely different from INT8 and it's shown as below.
12+
13+
<div align="center">
14+
<img src="./imgs/fp8_dtype.png" height="250"/>
15+
</div>
16+
17+
Intel Gaudi2, also known as HPU, provides this data type capability for low precision quantization, which includes `E4M3` and `E5M2`. For more information about these two data type, please refer to [link](https://arxiv.org/abs/2209.05433).
18+
19+
Intel Neural Compressor provides general quantization APIs to leverage HPU FP8 capability. with simple with lower memory usage and lower compute cost, 8 bit model
20+
21+
## Supported Parameters
22+
23+
<style type="text/css">
24+
.tg {border-collapse:collapse;border-spacing:0;}
25+
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
26+
overflow:hidden;padding:10px 5px;word-break:normal;}
27+
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
28+
font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
29+
.tg .tg-fymr{border-color:inherit;font-weight:bold;text-align:left;vertical-align:top}
30+
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
31+
</style>
32+
<table class="tg"><thead>
33+
<tr>
34+
<th class="tg-fymr">Attribute</th>
35+
<th class="tg-fymr">Description</th>
36+
<th class="tg-fymr">Values</th>
37+
</tr></thead>
38+
<tbody>
39+
<tr>
40+
<td class="tg-0pky">fp8_config</td>
41+
<td class="tg-0pky">The target data type of FP8 quantization.</td>
42+
<td class="tg-0pky">E4M3 (default) - As Fig. 2<br>E5M2 - As Fig. 1.</td>
43+
</tr>
44+
<tr>
45+
<td class="tg-0pky">hp_dtype</td>
46+
<td class="tg-0pky">The high precision data type of non-FP8 operators.</td>
47+
<td class="tg-0pky">bf16 (default) - torch.bfloat16<br>fp16 - torch.float16.<br>fp32 - torch.float32.</td>
48+
</tr>
49+
<tr>
50+
<td class="tg-0pky">observer</td>
51+
<td class="tg-0pky">The observer to measure the statistics.</td>
52+
<td class="tg-0pky">maxabs (default), saves all tensors to files.</td>
53+
</tr>
54+
<tr>
55+
<td class="tg-0pky">allowlist</td>
56+
<td class="tg-0pky">List of nn.Module names or types to quantize. When setting an empty list, all the supported modules will be quantized by default. See Supported Modules. Not setting the list at all is not recommended as it will set the allowlist to these modules only: torch.nn.Linear, torch.nn.Conv2d, and BMM.</td>
57+
<td class="tg-0pky">Default = {'names': [], 'types': <span title=["Matmul","Linear","FalconLinear","KVCache","Conv2d","LoRACompatibleLinear","LoRACompatibleConv","Softmax","ModuleFusedSDPA","LinearLayer","LinearAllreduce","ScopedLinearAllReduce","LmHeadLinearAllreduce"]>FP8_WHITE_LIST}</span></td>
58+
</tr>
59+
<tr>
60+
<td class="tg-0pky">blocklist</td>
61+
<td class="tg-0pky">List of nn.Module names or types not to quantize. Defaults to empty list, so you may omit it from the config file.</td>
62+
<td class="tg-0pky">Default = {'names': [], 'types': ()}</td>
63+
</tr>
64+
<tr>
65+
<td class="tg-0pky">mode</td>
66+
<td class="tg-0pky">The mode, measure or quantize, to run HQT with.</td>
67+
<td class="tg-0pky">MEASURE - Measure statistics of all modules and emit the results to dump_stats_path.<br>QUANTIZE - Quantize and run the model according to the provided measurements.<br>AUTO (default) - Select from [MEASURE, QUANTIZE] automatically.</td>
68+
</tr>
69+
<tr>
70+
<td class="tg-0pky">dump_stats_path</td>
71+
<td class="tg-0pky">The path to save and load the measurements. The path is created up until the level before last "/". The string after the last / will be used as prefix to all the measurement files that will be created.</td>
72+
<td class="tg-0pky">Default = "./hqt_output/measure"</td>
73+
</tr>
74+
<tr>
75+
<td class="tg-0pky">scale_method</td>
76+
<td class="tg-0pky">The method for calculating the scale from the measurement.</td>
77+
<td class="tg-0pky">- without_scale - Convert to/from FP8 without scaling.<br>- unit_scale - Always use scale of 1.<br>- maxabs_hw (default) - Scale is calculated to stretch/compress the maxabs measurement to the full-scale of FP8 and then aligned to the corresponding HW accelerated scale.<br>- maxabs_pow2 - Scale is calculated to stretch/compress the maxabs measurement to the full-scale of FP8 and then rounded to the power of 2.<br>- maxabs_hw_opt_weight - Scale of model params (weights) is chosen as the scale that provides minimal mean-square-error between quantized and non-quantized weights, from all possible HW accelerated scales. Scale of activations is calculated the same as maxabs_hw.<br>- act_maxabs_pow2_weights_pcs_opt_pow2 - Scale of model params (weights) is calculated per-channel of the params tensor. The scale per-channel is calculated the same as maxabs_hw_opt_weight. Scale of activations is calculated the same as maxabs_pow2.<br>- act_maxabs_hw_weights_pcs_maxabs_pow2 - Scale of model params (weights) is calculated per-channel of the params tensor. The scale per-channel is calculated the same as maxabs_pow2. Scale of activations is calculated the same as maxabs_hw.</td>
78+
</tr>
79+
<tr>
80+
<td class="tg-0pky">measure_exclude</td>
81+
<td class="tg-0pky">If this attribute is not defined, the default is OUTPUT. Since most models do not require measuring output tensors, you can exclude it to speed up the measurement process.</td>
82+
<td class="tg-0pky">NONE - All tensors are measured.<br>OUTPUT (default) - Excludes measurement of output tensors.</td>
83+
</tr>
84+
</tbody></table>
85+
86+
## Get Start with FP8 Quantization
87+
88+
### Demo Usage
89+
90+
```python
91+
from neural_compressor.torch.quantization import (
92+
FP8Config,
93+
prepare,
94+
convert,
95+
)
96+
import torchvision.models as models
97+
98+
model = models.resnet18()
99+
qconfig = FP8Config(fp8_config="E4M3")
100+
model = prepare(model, qconfig)
101+
# customer defined calibration
102+
calib_func(model)
103+
model = convert(model)
104+
```
105+
106+
## Examples
107+
108+
| Task | Example |
109+
|----------------------|---------|
110+
| Computer Vision (CV) | [Link](../../examples/3.x_api/pytorch/cv/fp8_quant/) |
111+
| Large Language Model (LLM) | [Link](https://github.com/HabanaAI/optimum-habana-fork/tree/habana-main/examples/text-generation#running-with-fp8) |
112+
113+
> Note: For LLM, Optimum-habana provides higher performance based on modified modeling files, so here the Link of LLM goes to Optimum-habana, which utilize Intel Neural Compressor for FP8 quantization internally.

examples/.config/model_params_pytorch_3x.json

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,13 @@
140140
"main_script": "main.py",
141141
"batch_size": 1
142142
},
143+
"resnet18_fp8_static":{
144+
"model_src_dir": "cv/fp8_quant",
145+
"dataset_location": "/tf_dataset/pytorch/ImageNet/raw",
146+
"input_model": "",
147+
"main_script": "main.py",
148+
"batch_size": 1
149+
},
143150
"opt_125m_pt2e_static":{
144151
"model_src_dir": "nlp/huggingface_models/language-modeling/quantization/static_quant/pt2e",
145152
"dataset_location": "",
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# ImageNet FP8 Quantization
2+
3+
This implements FP8 quantization of popular model architectures, such as ResNet on the ImageNet dataset, which is supported by Intel Gaudi2 AI Accelerator.
4+
5+
## Requirements
6+
7+
To try on Intel Gaudi2, docker image with Gaudi Software Stack is recommended, please refer to following script for environment setup. More details can be found in [Gaudi Guide](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#launch-docker-image-that-was-built).
8+
```bash
9+
# Run a container with an interactive shell
10+
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.17.0/ubuntu22.04/habanalabs/pytorch-installer-2.3.1:latest
11+
```
12+
13+
- Install requirements
14+
- `pip install -r requirements.txt`
15+
- Download the ImageNet dataset from http://www.image-net.org/
16+
- Then, move and extract the training and validation images to labeled subfolders, using [the following shell script](extract_ILSVRC.sh)
17+
18+
## Quantizaiton
19+
20+
To quant a model and validate accaracy, run `main.py` with the desired model architecture and the path to the ImageNet dataset:
21+
22+
```bash
23+
python main.py --pretrained -t -a resnet50 -b 30 /path/to/imagenet
24+
```
25+
or
26+
```bash
27+
bash run_quant.sh --input_model=resnet50 --dataset_location=/path/to/imagenet
28+
```

0 commit comments

Comments
 (0)