@@ -24,10 +24,10 @@ Welcome to software-delivered AI.
24
24
25
25
This guide explains how to deploy YOLOv5 with Neural Magic's DeepSparse.
26
26
27
- DeepSparse is an inference runtime with exceptional performance on CPUs. For instance, compared to ONNX Runtime's baseline, DeepSparse offers a 3.7x speed-up at batch size 1 and a 5.8x speed-up at batch size 64 for YOLOv5s !
27
+ DeepSparse is an inference runtime with exceptional performance on CPUs. For instance, compared to the ONNX Runtime baseline, DeepSparse offers a 5.8x speed-up for YOLOv5s, running on the same machine !
28
28
29
29
<p align =" center " >
30
- <img width =" 60% " src =" cluster-bar .png" >
30
+ <img width =" 60% " src =" performance-chart-5.8x .png" >
31
31
</p >
32
32
33
33
For the first time, your deep learning workloads can meet the performance demands of production without the complexity and costs of hardware accelerators.
@@ -77,111 +77,21 @@ DeepSparse accepts a model in the ONNX format, passed either as:
77
77
- A SparseZoo stub which identifies an ONNX file in the SparseZoo
78
78
- A local path to an ONNX model in a filesystem
79
79
80
- The examples below will use the standard dense YOLOv5s and pruned-quantized YOLOv5s checkpoints, identified by the following SparseZoo stubs:
80
+ The examples below use the standard dense and pruned-quantized YOLOv5s checkpoints, identified by the following SparseZoo stubs:
81
81
``` bash
82
82
zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none
83
83
zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none
84
- zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned35_quant-none-vnni # < pruned for VNNI machines
85
- ```
86
-
87
- ### Benchmark Performance
88
-
89
- We will compare DeepSparse's throughput to ONNX Runtime's throughput on YOLOv5s, using DeepSparse's benchmarking script.
90
-
91
- The benchmarks were run on an AWS ` c6i.8xlarge ` instance (16 cores).
92
-
93
- #### Batch 1 Performance Comparison
94
-
95
- ONNX Runtime achieves 49 images/sec with dense YOLOv5s.
96
-
97
- ``` bash
98
- deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 1 -nstreams 1 -e onnxruntime
99
-
100
- > Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none
101
- > Batch Size: 1
102
- > Scenario: sync
103
- > Throughput (items/sec): 48.8549
104
- > Latency Mean (ms/batch): 20.4613
105
- > Latency Median (ms/batch): 20.4192
106
- ```
107
-
108
- DeepSparse achieves 135 items/sec with a pruned-quantized YOLOv5s, ** a 2.8x performance gain over ONNX Runtime!**
109
-
110
- ``` bash
111
- deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none -s sync -b 1 -nstreams 1
112
-
113
- > Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none
114
- > Batch Size: 1
115
- > Scenario: sync
116
- > Throughput (items/sec): 135.0647
117
- > Latency Mean (ms/batch): 7.3895
118
- > Latency Median (ms/batch): 7.2398` ` `
119
- ` ` `
120
-
121
- Since ` c6i.8xlarge` instances have VNNI instructions, DeepSparse' s throughput can be pushed further if weights are pruned in blocks of 4. DeepSparse achieves 180 items/sec with a 4-block pruned-quantized YOLOv5s, a **3.7x performance gain over ONNX Runtime!**
122
-
123
- ```bash
124
- deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned35_quant-none-vnni -s sync -b 1 -nstreams 1
125
-
126
- > Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned35_quant-none-vnni
127
- > Batch Size: 1
128
- > Scenario: sync
129
- > Throughput (items/sec): 179.6016
130
- > Latency Mean (ms/batch): 5.5615
131
- > Latency Median (ms/batch): 5.5458
132
- ```
133
-
134
- #### Batch 64 Performance Comparison
135
-
136
- In latency-insensitive scenarios with large batch sizes, DeepSparse' s performance relative to ONNX Runtime is even stronger.
137
-
138
- ONNX Runtime achieves 42 images/sec with dense YOLOv5s:
139
-
140
- ` ` ` bash
141
- deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 64 -nstreams 1 -e onnxruntime
142
-
143
- > Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none
144
- > Batch Size: 64
145
- > Scenario: sync
146
- > Throughput (items/sec): 41.5560
147
- > Latency Mean (ms/batch): 1538.6640
148
- > Latency Median (ms/batch): 1538.0362
149
- ` ` `
150
-
151
- DeepSparse achieves 239 images/sec with pruned-quantized YOLOv5s, a ** 5.8x performance improvement over ORT** !
152
-
153
- ` ` ` bash
154
- deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none -s sync -b 64 -nstreams 1
155
-
156
- > Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none
157
- > Batch Size: 64
158
- > Scenario: sync
159
- > Throughput (items/sec): 239.0854
160
- > Latency Mean (ms/batch): 267.6703
161
- > Latency Median (ms/batch): 267.3194
162
84
```
163
85
164
86
### Deploy a Model
165
87
166
88
DeepSparse offers convenient APIs for integrating your model into an application.
167
89
168
- To try the deployment examples below, pull down a sample image for the example and save as ` basilica.jpg` with the following command :
90
+ To try the deployment examples below, pull down a sample image and save it as ` basilica.jpg ` with the following:
169
91
``` bash
170
92
wget -O basilica.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpg
171
93
```
172
94
173
- # ### Annotate CLI
174
- You can also use the annotate command to have the engine save an annotated photo on disk. Try --source 0 to annotate your live webcam feed!
175
- ` ` ` bash
176
- deepsparse.object_detection.annotate --model_filepath zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned_quant-aggressive_94 --source basilica.jpg
177
- ` ` `
178
-
179
- Running the above command will create an ` annotation-results` folder and save the annotated image inside.
180
-
181
- < p align = " center" >
182
- < img src=" https://github.com/neuralmagic/deepsparse/blob/d31f02596ebff2ec62761d0bc9ca14c4663e8858/src/deepsparse/yolo/sample_images/basilica-annotated.jpg" alt=" annotated" width=" 60%" />
183
- < /p>
184
-
185
95
#### Python API
186
96
187
97
` Pipelines ` wrap pre-processing and output post-processing around the runtime, providing a clean inferface for adding DeepSparse to an application.
@@ -239,6 +149,110 @@ bounding_boxes = annotations["boxes"]
239
149
labels = annotations[" labels" ]
240
150
```
241
151
152
+ #### Annotate CLI
153
+ You can also use the annotate command to have the engine save an annotated photo on disk. Try --source 0 to annotate your live webcam feed!
154
+ ``` bash
155
+ deepsparse.object_detection.annotate --model_filepath zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none --source basilica.jpg
156
+ ```
157
+
158
+ Running the above command will create an ` annotation-results ` folder and save the annotated image inside.
159
+
160
+ <p align = " center " >
161
+ <img src =" https://github.com/neuralmagic/deepsparse/blob/d31f02596ebff2ec62761d0bc9ca14c4663e8858/src/deepsparse/yolo/sample_images/basilica-annotated.jpg " alt =" annotated " width =" 60% " />
162
+ </p >
163
+
164
+ ## Benchmarking Performance
165
+
166
+ We will compare DeepSparse's throughput to ONNX Runtime's throughput on YOLOv5s, using DeepSparse's benchmarking script.
167
+
168
+ The benchmarks were run on an AWS ` c6i.8xlarge ` instance (16 cores).
169
+
170
+ ### Batch 32 Performance Comparsion
171
+
172
+ #### ONNX Runtime Baseline
173
+
174
+ At batch 32, ONNX Runtime achieves 42 images/sec with the standard dense YOLOv5s:
175
+
176
+ ``` bash
177
+ deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 32 -nstreams 1 -e onnxruntime
178
+
179
+ > Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none
180
+ > Batch Size: 32
181
+ > Scenario: sync
182
+ > Throughput (items/sec): 41.9025
183
+ ```
184
+
185
+ #### DeepSparse Dense Performance
186
+
187
+ While DeepSparse offers its best performance with optimized sparse models, it also performs well with the standard dense YOLOv5s.
188
+
189
+ At batch 32, DeepSparse achieves 70 images/sec with the standard dense YOLOv5s, a ** 1.7x performance improvement over ORT** !
190
+
191
+ ``` bash
192
+ deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 32 -nstreams 1
193
+
194
+ > Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none
195
+ > Batch Size: 32
196
+ > Scenario: sync
197
+ > Throughput (items/sec): 69.5546
198
+ ```
199
+ #### DeepSparse Sparse Performance
200
+
201
+ When sparsity is applied to the model, DeepSparse's performance gains over ONNX Runtime is even stronger.
202
+
203
+ At batch 32, DeepSparse achieves 241 images/sec with the pruned-quantized YOLOv5s, a ** 5.8x performance improvement over ORT** !
204
+
205
+ ``` bash
206
+ deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none -s sync -b 32 -nstreams 1
207
+
208
+ > Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none
209
+ > Batch Size: 32
210
+ > Scenario: sync
211
+ > Throughput (items/sec): 241.2452
212
+ ```
213
+
214
+ ### Batch 1 Performance Comparison
215
+
216
+ DeepSparse is also able to gain a speed-up over ONNX Runtime for the latency-sensitive, batch 1 scenario.
217
+
218
+ #### ONNX Runtime Baseline
219
+ At batch 1, ONNX Runtime achieves 48 images/sec with the standard, dense YOLOv5s.
220
+
221
+ ``` bash
222
+ deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 1 -nstreams 1 -e onnxruntime
223
+
224
+ > Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none
225
+ > Batch Size: 1
226
+ > Scenario: sync
227
+ > Throughput (items/sec): 48.0921
228
+ ```
229
+
230
+ #### DeepSparse Sparse Performance
231
+
232
+ At batch 1, DeepSparse achieves 135 items/sec with a pruned-quantized YOLOv5s, ** a 2.8x performance gain over ONNX Runtime!**
233
+
234
+ ``` bash
235
+ deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none -s sync -b 1 -nstreams 1
236
+
237
+ > Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none
238
+ > Batch Size: 1
239
+ > Scenario: sync
240
+ > Throughput (items/sec): 134.9468
241
+ ```
242
+
243
+ Since ` c6i.8xlarge ` instances have VNNI instructions, DeepSparse's throughput can be pushed further if weights are pruned in blocks of 4.
244
+
245
+ At batch 1, DeepSparse achieves 180 items/sec with a 4-block pruned-quantized YOLOv5s, a ** 3.7x performance gain over ONNX Runtime!**
246
+
247
+ ``` bash
248
+ deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned35_quant-none-vnni -s sync -b 1 -nstreams 1
249
+
250
+ > Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned35_quant-none-vnni
251
+ > Batch Size: 1
252
+ > Scenario: sync
253
+ > Throughput (items/sec): 179.7375
254
+ ```
255
+
242
256
## Get Started With DeepSparse
243
257
244
258
** Research or Testing?** DeepSparse Community is free for research and testing. Get started with our [ Documentation] ( https://docs.neuralmagic.com/ ) .
0 commit comments