You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Applies int8 dnynamic symmetric per-token activation and int8 per-channel weight
613
614
quantization + 2:4 sparsity to linear layers.
614
615
"""
616
+
warnings.warn("""int8_dyanmic_activation_int8_semi_sparse_weight() will be deprecated at a later release. Please use the layout_type kwarg in int8_dynamic_activation_int8_weight instead.
Copy file name to clipboardExpand all lines: torchao/sparsity/README.md
+65-31Lines changed: 65 additions & 31 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,27 +2,10 @@
2
2
3
3
Sparsity is the technique of removing parameters from a neural network in order to reduce its memory overhead or latency. By carefully choosing how the elements are pruned, one can achieve significant reduction in memory overhead and latency, while paying a reasonably low or no price in terms of model quality (accuracy / f1).
4
4
5
-
## Goal
6
5
7
-
We feel that the main problem current sparsity researchers / users face is fragmentation. Researchers rightfully aim to show end-to-end results, but this means a lot of time is spent figuring out how to integrate with PyTorch and implementation questions like:
8
-
-*When should I mask?*
9
-
-*When/how should I store the compressed representation?*
10
-
-*Do I want in-place or out-of-place mask updates?*
11
-
-*How can I call sparse matmul instead of dense?*
12
-
13
-
We feel like the above problems can be solved once by `torchao`, letting researchers focus on what really matters - pushing sparse kernel performance or more accurate pruning algorithms.
14
-
15
-
More concretely, we hope to provide tutorials and APIs for both sparse kernels (tensor subclassing) and pruning algorithms (torch.ao.pruning.Sparsifier) that users can extend. We aim to provide modular building blocks, that can be used to accelerate not only inference but training as well, and that compose nicely with `torchao` quantization workflows.
16
-
17
-
1. Train sparse models from scratch with hardware acceleration, with minimal accuracy loss.
18
-
2. Recover accuracy loss of pruned model with custom pruning algorthim.
19
-
3. Accelerate masked/pruned models on sparsity-supported hardware to realize performance improvements.
20
-
21
-
## Success Stories
22
-
23
-
#### segment-anything-fast
24
-
We applied 2:4 sparsity to accelerate segment-anything, as part of [segment-anything-fast](https://github.com/pytorch-labs/segment-anything-fast).
6
+
## Benchmarks
25
7
8
+
### segment-anything-fast
26
9
We were able to provide a **1.16x (22.7 -> 26.5 img/s) speedup over our dense baseline, while maintaining 97.5% (0.581 -> 0.567) of the evaluation accuracy (mIOU)**.
27
10
28
11
Overall, we found that accelerating the MLP linear layers provied the most speedups (`lin1`, `lin2`), while mitigating accuracy loss.
@@ -47,20 +30,23 @@ The following benchmarks we ran for sam ViT-h on an NVIDIA-A100-80GB, with batch
47
30
48
31
To reproduce our benchmarks please follow these [instructions](/torchao/_models/sam/README.md).
49
32
50
-
#### BERT
33
+
###LLama3
51
34
52
-
We were able to accelerate BERT 1.23x on an A100 with a negligible accuracy drop on SQuAD.
53
-
For more information about accelerting BERT with semi-sturcutred sparsity, please see our [tutorial](https://pytorch.org/tutorials/advanced/semi_structured_sparse.html?highlight=beta).
35
+
On Meta LLama3, we observe a 25% tok/s increase (180 -> 226) compared to our existing int4-wo implementation when using the sparse marlin kernel @Diogo-V added.
54
36
55
-
| Metrics | fp16 | 2:4 sparse | delta / speedup |
56
-
| --- | --- | --- | --- |
57
-
| Exact Match (%) | 78.53 | 78.44 | -0.09 |
58
-
| F1 (%) | 86.93 | 86.49 | -0.44 |
59
-
| Time (bs=16) | 19.35 | 15.74 | 1.23x |
37
+
| Model | Technique | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) |
We support composing int8 dynaic quantization with 2:4 sparsity. We fuse one of the scalar dequant multiplications into our cuSPARSELt sparse mm in order to remain performant.
67
+
68
+
```py
69
+
from torchao.quantization.quant_api import quantize_, int8_dynamic_activation_int8_weight
from torchao.sparsity.sparse_api import sparsify_, semi_sparse_weight
80
+
from torchao.dtypes import SemiSparseLayoutType
81
+
82
+
model = model.cuda()
83
+
sparsify_(model, semi_sparse_weight())
84
+
```
85
+
86
+
### Block sparsity (prototype)
87
+
We offer prototype support for accelerating block sparsity with our triton kernels for bfloat16/float16 workloads.
88
+
89
+
```py
90
+
from torchao.sparsity.sparse_api import sparsify_
91
+
from torchao.sparsity.prototype.superblock.blocksparse import block_sparse_weight
92
+
93
+
model = model.cuda()
94
+
sparsify_(model, block_sparse_weight())
95
+
```
96
+
97
+
# Goal
98
+
99
+
We feel that the main problem current sparsity researchers / users face is fragmentation. Researchers rightfully aim to show end-to-end results, but this means a lot of time is spent figuring out how to integrate with PyTorch and implementation questions like:
100
+
-*When should I mask?*
101
+
-*When/how should I store the compressed representation?*
102
+
-*Do I want in-place or out-of-place mask updates?*
103
+
-*How can I call sparse matmul instead of dense?*
104
+
105
+
We feel like the above problems can be solved once by `torchao`, letting researchers focus on what really matters - pushing sparse kernel performance or more accurate pruning algorithms.
106
+
107
+
More concretely, we hope to provide tutorials and APIs for both sparse kernels (tensor subclassing) and pruning algorithms (torch.ao.pruning.Sparsifier) that users can extend. We aim to provide modular building blocks, that can be used to accelerate not only inference but training as well, and that compose nicely with `torchao` quantization workflows.
108
+
109
+
1. Train sparse models from scratch with hardware acceleration, with minimal accuracy loss.
110
+
2. Recover accuracy loss of pruned model with custom pruning algorthim.
111
+
3. Accelerate masked/pruned models on sparsity-supported hardware to realize performance improvements.
112
+
113
+
## Design
80
114
81
115
Sparsity, like quantization, is an accuracy/performance trade-off, where we care not only about the speedup but also on the accuracy degradation of our architecture optimization technique.
0 commit comments