Skip to content

Commit d254d50

Browse files
authored
Update doc for client-usage and LWQ (#1947)
Signed-off-by: yiliu30 <[email protected]>
1 parent f253d35 commit d254d50

File tree

4 files changed

+30
-15
lines changed

4 files changed

+30
-15
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ support AMD CPU, ARM CPU, and NVidia GPU through ONNX Runtime with limited testi
2828

2929
## What's New
3030
* [2024/07] From 3.0 release, framework extension API is recommended to be used for quantization.
31-
* [2024/07] Performance optimizations and usability improvements on [client-side](https://github.com/intel/neural-compressor/blob/master/docs/3x/client_quant.md).
31+
* [2024/07] Performance optimizations and usability improvements on [client-side](https://github.com/intel/neural-compressor/blob/master/docs/source/3x/client_quant.md).
3232

3333
## Installation
3434

docs/source/3x/PT_WeightOnlyQuant.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ PyTorch Weight Only Quantization
1515
- [HQQ](#hqq)
1616
- [Specify Quantization Rules](#specify-quantization-rules)
1717
- [Saving and Loading](#saving-and-loading)
18+
- [Layer Wise Quantization](#layer-wise-quantization)
1819
- [Efficient Usage on Client-Side](#efficient-usage-on-client-side)
1920
- [Examples](#examples)
2021

@@ -277,9 +278,33 @@ loaded_model = load(
277278
) # Please note that the original_model parameter passes the original model.
278279
```
279280

281+
## Layer Wise Quantization
282+
283+
As the size of LLMs continues to grow, loading the entire model into a single GPU card or the RAM of a client machine becomes impractical. To address this challenge, we introduce Layer-wise Quantization (LWQ), a method that quantizes LLMs layer by layer or block by block. This approach significantly reduces memory consumption. The diagram below illustrates the LWQ process.
284+
285+
<img src="./imgs/lwq.png" width=780 height=429>
286+
287+
*Figure 1: The process of layer-wise quantization for PyTorch model. The color grey means empty parameters and the color blue represents parameters need to be quantized. Every rectangle inside model represents one layer.*
288+
289+
290+
Currently, we support LWQ for `RTN`, `AutoRound`, and `GPTQ`.
291+
292+
Here, we take the `RTN` algorithm as example to demonstrate the usage of LWQ.
293+
294+
```python
295+
from neural_compressor.torch.quantization import RTNConfig, convert, prepare
296+
from neural_compressor.torch import load_empty_model
297+
298+
model_state_dict_path = "/path/to/model/state/dict"
299+
float_model = load_empty_model(model_state_dict_path)
300+
quant_config = RTNConfig(use_layer_wise=True)
301+
prepared_model = prepare(float_model, quant_config)
302+
quantized_model = convert(prepared_model)
303+
```
304+
280305
## Efficient Usage on Client-Side
281306

282-
For client machines with limited RAM and cores, we offer optimizations to reduce computational overhead and minimize memory usage. For detailed information, please refer to [Quantization on Client](https://github.com/intel/neural-compressor/blob/master/docs/3x/client_quant.md).
307+
For client machines with limited RAM and cores, we offer optimizations to reduce computational overhead and minimize memory usage. For detailed information, please refer to [Quantization on Client](https://github.com/intel/neural-compressor/blob/master/docs/source/3x/client_quant.md).
283308

284309

285310
## Examples

docs/3x/client_quant.md renamed to docs/source/3x/client_quant.md

Lines changed: 3 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -2,20 +2,15 @@ Quantization on Client
22
==========================================
33

44
1. [Introduction](#introduction)
5-
2. [Get Started](#get-started) \
6-
2.1 [Get Default Algorithm Configuration](#get-default-algorithm-configuration)\
7-
2.2 [Optimal Performance and Peak Memory Usage](#optimal-performance-and-peak-memory-usage)
8-
5+
2. [Get Started](#get-started)
96

107
## Introduction
118

12-
For `RTN`, `GPTQ`, and `Auto-Round` algorithms, we provide default algorithm configurations for different processor types (`client` and `sever`). Generally, lightweight configurations are tailored specifically for client devices to enhance performance and efficiency.
9+
For `RTN`, and `GPTQ` algorithms, we provide default algorithm configurations for different processor types (`client` and `sever`). Generally, lightweight configurations are tailored specifically for client devices to enhance performance and efficiency.
1310

1411

1512
## Get Started
1613

17-
### Get Default Algorithm Configuration
18-
1914
Here, we take the `RTN` algorithm as example to demonstrate the usage on a client machine.
2015

2116
```python
@@ -42,9 +37,4 @@ python main.py
4237
> [!TIP]
4338
> For Linux systems, users need to configure the environment variables appropriately to achieve optimal performance. For example, set the `OMP_NUM_THREADS` explicitly. For processors with hybrid architecture (including both P-cores and E-cores), it is recommended to bind tasks to all P-cores using `taskset`.
4439
45-
### Optimal Performance and Peak Memory Usage
46-
47-
Below are approximate performance and memory usage figures conducted on a client machine with 24 cores and 32GB of RAM. These figures provide a rough estimate for quick reference and may vary based on specific hardware and configurations.
48-
49-
- 7B models (e.g., [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)): the quantization process takes about 65 seconds, with a peak memory usage of around 6GB.
50-
- 1.5B models (e.g., [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct)), the quantization process takes about 20 seconds, with a peak memory usage of around 5GB.
40+
RTN quantization is a quick process, finishing in tens of seconds and using several GB of RAM when working with 7B models, e.g.,[meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). However, for the higher accuracy, GPTQ algorithm is recommended, but be prepared for a longer quantization time.

docs/source/3x/imgs/lwq.png

58.4 KB
Loading

0 commit comments

Comments
 (0)