Skip to content

Commit a4221df

Browse files
authored
move float8 inference README contents to prototype section (#901)
* move float8 inference README contents to prototype section * Update README.md
1 parent bd264f9 commit a4221df

File tree

1 file changed

+20
-16
lines changed

1 file changed

+20
-16
lines changed

torchao/quantization/README.md

Lines changed: 20 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -121,22 +121,6 @@ from torchao.quantization.quant_api import change_linear_weights_to_int8_dqtenso
121121
change_linear_weights_to_int8_dqtensors(model)
122122
```
123123

124-
#### A16W8 Float8 WeightOnly Quantization
125-
126-
```python
127-
# for torch 2.5+
128-
from torchao.quantization import quantize_, float8_weight_only
129-
quantize_(model, float8_weight_only())
130-
```
131-
132-
#### A16W8 Float8 Dynamic Quantization with Rowwise Scaling
133-
134-
```python
135-
# for torch 2.5+
136-
from torchao.quantization.quant_api import quantize_, PerRow, float8_dynamic_activation_float8_weight
137-
quantize_(model, float8_dynamic_activation_float8_weight(granularity=PerRow()))
138-
```
139-
140124
#### A16W6 Floating Point WeightOnly Quantization
141125

142126
```python
@@ -303,6 +287,26 @@ You try can out these apis with the `quantize_` api as above alongside the const
303287
### Automatic Inductor Configuration
304288
The `quantize_` and `autoquant` apis now automatically use our recommended inductor configuration setings. You can mimic the same configuration settings for your own experiments by using the `torchao.quantization.utils.recommended_inductor_config_setter` to replicate our recommended configuration settings. Alternatively if you wish to disable these recommended settings, you can use the key word argument `set_inductor_config` and set it to false in the `quantize_` or `autoquant` apis to prevent assignment of those configuration settings. You can also overwrite these configuration settings after they are assigned if you so desire, as long as they are overwritten before passing any inputs to the torch.compiled model. This means that previous flows which referenced a variety of inductor configurations that needed to be set are now outdated, though continuing to manually set those same inductor configurations is unlikely to cause any issues.
305289

290+
### (prototype) A16W8 Float8 WeightOnly Quantization
291+
292+
```python
293+
# for torch 2.5+
294+
from torchao.quantization import quantize_, float8_weight_only
295+
quantize_(model, float8_weight_only())
296+
```
297+
298+
This API works today but has not been extensively tested and benchmarked yet. Hardware with CUDA compute capability 8.9 or greater is required.
299+
300+
### (prototype) A16W8 Float8 Dynamic Quantization with Rowwise Scaling
301+
302+
```python
303+
# for torch 2.5+
304+
from torchao.quantization.quant_api import quantize_, PerRow, float8_dynamic_activation_float8_weight
305+
quantize_(model, float8_dynamic_activation_float8_weight(granularity=PerRow()))
306+
```
307+
308+
This API works today but has not been extensively tested and benchmarked yet. Hardware with CUDA compute capability 8.9 or greater is required.
309+
306310
## (To be moved to prototype) A16W4 WeightOnly Quantization with GPTQ
307311

308312
```python

0 commit comments

Comments
 (0)