You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -28,7 +28,7 @@ support AMD CPU, ARM CPU, and NVidia GPU through ONNX Runtime with limited testi
28
28
29
29
## What's New
30
30
*[2024/07] From 3.0 release, framework extension API is recommended to be used for quantization.
31
-
*[2024/07] Performance optimizations and usability improvements on [client-side](https://github.com/intel/neural-compressor/blob/master/docs/3x/client_quant.md).
31
+
*[2024/07] Performance optimizations and usability improvements on [client-side](https://github.com/intel/neural-compressor/blob/master/docs/source/3x/client_quant.md).
-[Efficient Usage on Client-Side](#efficient-usage-on-client-side)
19
20
-[Examples](#examples)
20
21
@@ -277,9 +278,33 @@ loaded_model = load(
277
278
) # Please note that the original_model parameter passes the original model.
278
279
```
279
280
281
+
## Layer Wise Quantization
282
+
283
+
As the size of LLMs continues to grow, loading the entire model into a single GPU card or the RAM of a client machine becomes impractical. To address this challenge, we introduce Layer-wise Quantization (LWQ), a method that quantizes LLMs layer by layer or block by block. This approach significantly reduces memory consumption. The diagram below illustrates the LWQ process.
284
+
285
+
<imgsrc="./imgs/lwq.png"width=780height=429>
286
+
287
+
*Figure 1: The process of layer-wise quantization for PyTorch model. The color grey means empty parameters and the color blue represents parameters need to be quantized. Every rectangle inside model represents one layer.*
288
+
289
+
290
+
Currently, we support LWQ for `RTN`, `AutoRound`, and `GPTQ`.
291
+
292
+
Here, we take the `RTN` algorithm as example to demonstrate the usage of LWQ.
293
+
294
+
```python
295
+
from neural_compressor.torch.quantization import RTNConfig, convert, prepare
296
+
from neural_compressor.torch import load_empty_model
For client machines with limited RAM and cores, we offer optimizations to reduce computational overhead and minimize memory usage. For detailed information, please refer to [Quantization on Client](https://github.com/intel/neural-compressor/blob/master/docs/3x/client_quant.md).
307
+
For client machines with limited RAM and cores, we offer optimizations to reduce computational overhead and minimize memory usage. For detailed information, please refer to [Quantization on Client](https://github.com/intel/neural-compressor/blob/master/docs/source/3x/client_quant.md).
2.2 [Optimal Performance and Peak Memory Usage](#optimal-performance-and-peak-memory-usage)
8
-
5
+
2.[Get Started](#get-started)
9
6
10
7
## Introduction
11
8
12
-
For `RTN`, `GPTQ`, and `Auto-Round` algorithms, we provide default algorithm configurations for different processor types (`client` and `sever`). Generally, lightweight configurations are tailored specifically for client devices to enhance performance and efficiency.
9
+
For `RTN`, and `GPTQ` algorithms, we provide default algorithm configurations for different processor types (`client` and `sever`). Generally, lightweight configurations are tailored specifically for client devices to enhance performance and efficiency.
13
10
14
11
15
12
## Get Started
16
13
17
-
### Get Default Algorithm Configuration
18
-
19
14
Here, we take the `RTN` algorithm as example to demonstrate the usage on a client machine.
20
15
21
16
```python
@@ -42,9 +37,4 @@ python main.py
42
37
> [!TIP]
43
38
> For Linux systems, users need to configure the environment variables appropriately to achieve optimal performance. For example, set the `OMP_NUM_THREADS` explicitly. For processors with hybrid architecture (including both P-cores and E-cores), it is recommended to bind tasks to all P-cores using `taskset`.
44
39
45
-
### Optimal Performance and Peak Memory Usage
46
-
47
-
Below are approximate performance and memory usage figures conducted on a client machine with 24 cores and 32GB of RAM. These figures provide a rough estimate for quick reference and may vary based on specific hardware and configurations.
48
-
49
-
- 7B models (e.g., [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)): the quantization process takes about 65 seconds, with a peak memory usage of around 6GB.
50
-
- 1.5B models (e.g., [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct)), the quantization process takes about 20 seconds, with a peak memory usage of around 5GB.
40
+
RTN quantization is a quick process, finishing in tens of seconds and using several GB of RAM when working with 7B models, e.g.,[meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). However, for the higher accuracy, GPTQ algorithm is recommended, but be prepared for a longer quantization time.
0 commit comments