|
29 | 29 | - [Sept 25] MXFP8 training achieved [1.28x speedup on Crusoe B200 cluster](https://pytorch.org/blog/accelerating-2k-scale-pre-training-up-to-1-28x-with-torchao-mxfp8-and-torchtitan-on-crusoe-b200-cluster/) with virtually identical loss curve to bfloat16! |
30 | 30 | - [Sept 19] [TorchAO Quantized Model and Quantization Recipes Now Available on Huggingface Hub](https://pytorch.org/blog/torchao-quantized-models-and-quantization-recipes-now-available-on-huggingface-hub/)! |
31 | 31 | - [Jun 25] Our [TorchAO paper](https://openreview.net/attachment?id=HpqH0JakHf&name=pdf) was accepted to CodeML @ ICML 2025! |
32 | | -- [May 25] QAT is now integrated into [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) for fine-tuning ([docs](https://docs.axolotl.ai/docs/qat.html))! |
33 | | -- [Apr 25] Float8 rowwise training yielded [1.34-1.43x training speedup](https://pytorch.org/blog/accelerating-large-scale-training-and-convergence-with-pytorch-float8-rowwise-on-crusoe-2k-h200s/) at 2k H100 GPU scale |
34 | | -- [Apr 25] TorchAO is added as a [quantization backend to vLLM](https://docs.vllm.ai/en/latest/features/quantization/torchao.html) ([docs](https://docs.vllm.ai/en/latest/features/quantization/torchao.html))! |
| 32 | + |
35 | 33 |
|
36 | 34 | <details> |
37 | 35 | <summary>Older news</summary> |
38 | 36 |
|
| 37 | +- [May 25] QAT is now integrated into [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) for fine-tuning ([docs](https://docs.axolotl.ai/docs/qat.html))! |
| 38 | +- [Apr 25] Float8 rowwise training yielded [1.34-1.43x training speedup](https://pytorch.org/blog/accelerating-large-scale-training-and-convergence-with-pytorch-float8-rowwise-on-crusoe-2k-h200s/) at 2k H100 GPU scale |
| 39 | +- [Apr 25] TorchAO is added as a [quantization backend to vLLM](https://docs.vllm.ai/en/latest/features/quantization/torchao.html) ([docs](https://docs.vllm.ai/en/latest/features/quantization/torchao.html))! |
39 | 40 | - [Mar 25] Our [2:4 Sparsity paper](https://openreview.net/pdf?id=O5feVk7p6Y) was accepted to SLLM @ ICLR 2025! |
40 | 41 | - [Jan 25] Our [integration with GemLite and SGLang](https://pytorch.org/blog/accelerating-llm-inference/) yielded 1.1-2x faster inference with int4 and float8 quantization across different batch sizes and tensor parallel sizes |
41 | 42 | - [Jan 25] We added [1-8 bit ARM CPU kernels](https://pytorch.org/blog/hi-po-low-bit-operators/) for linear and embedding ops |
|
51 | 52 |
|
52 | 53 | ## 🌅 Overview |
53 | 54 |
|
54 | | -TorchAO is an easy to use quantization library for native PyTorch. TorchAO works out-of-the-box with `torch.compile()` and `FSDP2` across most HuggingFace PyTorch models. Key features include: |
55 | | -* Float8 [training](torchao/float8/README.md) and [inference](https://docs.pytorch.org/ao/main/generated/torchao.quantization.Float8DynamicActivationFloat8WeightConfig.html) for speedups without compromising accuracy |
56 | | -* [MX training and inference](torchao/prototype/mx_formats/README.md), provides MX tensor formats based on native PyTorch MX dtypes (prototype) |
57 | | -* [Low precision MoE training](torchao/prototype/moe_training/README.md) provides training speedups with comparable numerics to bfloat16 training. |
58 | | -* [Quantization-Aware Training (QAT)](torchao/quantization/qat/README.md) for mitigating quantization degradation |
59 | | -* [Post-Training Quantization (PTQ)](torchao/quantization/README.md) for int4, int8, fp6 etc, with matching kernels targeting a variety of backends including CUDA, ARM CPU, and XNNPACK |
60 | | -* [Sparsity](torchao/sparsity/README.md), includes different techniques such as 2:4 sparsity and block sparsity |
| 55 | +TorchAO is an easy to use quantization library for native PyTorch. TorchAO works out-of-the-box with `torch.compile()` and `FSDP2` across most HuggingFace PyTorch models. |
| 56 | + |
| 57 | +### Stable Workflows |
| 58 | + |
| 59 | +| recommended hardware | weight | activation | quantized training | QAT | PTQ data algorithms | quantized inference | |
| 60 | +| -------- | ------ | ---------- | ------------------ | --- | ------------------- | ------------------- | |
| 61 | +| H100, B200 GPUs | float8 rowwise | float8 rowwise | 🟢 stable [(link)](torchao/float8) | 🟢 stable [(link)](torchao/quantization/qat) | ⚪ not supported | 🟢 stable [(link)](torchao/quantization#a8w8-float8-dynamic-quantization-with-rowwise-scaling) | |
| 62 | +| H100 GPUs | int4 | float8 rowwise | ⚪ not supported | 🟢 stable [(link)](torchao/quantization/qat) | ⚪ planned | 🟢 stable [(link)](https://github.com/pytorch/ao/blob/257d18ae1b41e8bd8d85849dd2bd43ad3885678e/torchao/quantization/quant_api.py#L1296) | |
| 63 | +| A100 GPUs | int4 | bfloat16 | ⚪ not supported | 🟢 stable [(link)](torchao/quantization/qat) | 🟡 prototype: [HQQ](torchao/prototype/hqq/README.md), [AWQ](torchao/prototype/awq), [GPTQ](torchao/quantization/GPTQ) | 🟢 stable [(link)](torchao/quantization#a16w4-weightonly-quantization) | |
| 64 | +| A100 GPUs | int8 | bfloat16 | ⚪ not supported | 🟢 stable [(link)](torchao/quantization/qat) | ⚪ not supported | 🟢 stable [(link)](torchao/quantization#a16w8-int8-weightonly-quantization) | |
| 65 | +| A100 GPUs | int8 | int8 | 🟡 prototype [(link)](torchao/prototype/quantized_training) | 🟢 stable [(link)](torchao/quantization/qat) | ⚪ not supported | 🟢 stable [(link)](https://github.com/pytorch/ao/tree/main/torchao/quantization#a8w8-int8-dynamic-quantization) | |
| 66 | +| edge | intx (1..7) | bfloat16 | ⚪ not supported | 🟢 stable [(link)](torchao/quantization/qat) | ⚪ not supported | 🟢 stable [(link)](https://github.com/pytorch/ao/blob/257d18ae1b41e8bd8d85849dd2bd43ad3885678e/torchao/quantization/quant_api.py#L2267) | |
| 67 | +| edge | intx (1..7) | bfloat16 | ⚪ not supported | 🟢 stable [(link)](torchao/quantization/qat) | ⚪ not supported | 🟢 stable [(link)](https://github.com/pytorch/ao/blob/257d18ae1b41e8bd8d85849dd2bd43ad3885678e/torchao/quantization/quant_api.py#L702) | |
| 68 | + |
| 69 | +### Prototype Workflows |
| 70 | + |
| 71 | +| recommended hardware | weight | activation | quantized training | QAT | PTQ data algorithms | quantized inference | |
| 72 | +| -------- | ------ | ---------- | ------------------ | --- | ------------------- | ------------------- | |
| 73 | +| B200, MI350x GPUs | mxfp8 | mxfp8 | 🟡 prototype [(dense)](torchao/prototype/mx_formats#mx-training), [(moe)](torchao/prototype/moe_training) | ⚪ not supported | ⚪ not supported | 🟡 prototype [(link)](torchao/prototype/mx_formats#mx-inference) | |
| 74 | +| B200 GPUs | nvfp4 | nvfp4 | ⚪ planned | 🟡 prototype [(link)](torchao/prototype/qat/nvfp4.py) | ⚪ planned | 🟡 prototype [(link)](torchao/prototype/mx_formats#mx-inference) | |
| 75 | +| B200, MI350x GPUs | mxfp4 | mxfp4 | ⚪ not supported | ⚪ planned | ⚪ planned | 🟡 early prototype [(link)](torchao/prototype/mx_formats#mx-inference) | |
| 76 | +| H100 | float8 128x128 (blockwise) | float8 1x128 | ⚪ planned | ⚪ not supported | ⚪ not supported | 🟡 early prototype | |
| 77 | + |
| 78 | +### Other |
| 79 | + |
| 80 | +* [Quantization-Aware Training (QAT) README.md](torchao/quantization/qat/README.md) |
| 81 | +* [Post-Training Quantization (PTQ) README.md](torchao/quantization/README.md) |
| 82 | +* [Sparsity README.md](torchao/sparsity/README.md), includes different techniques such as 2:4 sparsity and block sparsity |
| 83 | +* [the prototype folder](torchao/prototype) for other prototype features |
61 | 84 |
|
62 | 85 | Check out our [docs](https://docs.pytorch.org/ao/main/) for more details! |
63 | 86 |
|
|
0 commit comments