@@ -24,10 +24,10 @@ Welcome to software-delivered AI.
2424
2525This guide explains how to deploy YOLOv5 with Neural Magic's DeepSparse.
2626
27- DeepSparse is an inference runtime with exceptional performance on CPUs. For instance, compared to ONNX Runtime's baseline, DeepSparse offers a 3.7x speed-up at batch size 1 and a 5.8x speed-up at batch size 64 for YOLOv5s !
27+ DeepSparse is an inference runtime with exceptional performance on CPUs. For instance, compared to the ONNX Runtime baseline, DeepSparse offers a 5.8x speed-up for YOLOv5s, running on the same machine !
2828
2929<p align =" center " >
30- <img width =" 60% " src =" cluster-bar .png" >
30+ <img width =" 60% " src =" performance-chart-5.8x .png" >
3131</p >
3232
3333For the first time, your deep learning workloads can meet the performance demands of production without the complexity and costs of hardware accelerators.
@@ -77,111 +77,21 @@ DeepSparse accepts a model in the ONNX format, passed either as:
7777- A SparseZoo stub which identifies an ONNX file in the SparseZoo
7878- A local path to an ONNX model in a filesystem
7979
80- The examples below will use the standard dense YOLOv5s and pruned-quantized YOLOv5s checkpoints, identified by the following SparseZoo stubs:
80+ The examples below use the standard dense and pruned-quantized YOLOv5s checkpoints, identified by the following SparseZoo stubs:
8181``` bash
8282zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none
8383zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none
84- zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned35_quant-none-vnni # < pruned for VNNI machines
85- ```
86-
87- ### Benchmark Performance
88-
89- We will compare DeepSparse's throughput to ONNX Runtime's throughput on YOLOv5s, using DeepSparse's benchmarking script.
90-
91- The benchmarks were run on an AWS ` c6i.8xlarge ` instance (16 cores).
92-
93- #### Batch 1 Performance Comparison
94-
95- ONNX Runtime achieves 49 images/sec with dense YOLOv5s.
96-
97- ``` bash
98- deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 1 -nstreams 1 -e onnxruntime
99-
100- > Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none
101- > Batch Size: 1
102- > Scenario: sync
103- > Throughput (items/sec): 48.8549
104- > Latency Mean (ms/batch): 20.4613
105- > Latency Median (ms/batch): 20.4192
106- ```
107-
108- DeepSparse achieves 135 items/sec with a pruned-quantized YOLOv5s, ** a 2.8x performance gain over ONNX Runtime!**
109-
110- ``` bash
111- deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none -s sync -b 1 -nstreams 1
112-
113- > Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none
114- > Batch Size: 1
115- > Scenario: sync
116- > Throughput (items/sec): 135.0647
117- > Latency Mean (ms/batch): 7.3895
118- > Latency Median (ms/batch): 7.2398` ` `
119- ` ` `
120-
121- Since ` c6i.8xlarge` instances have VNNI instructions, DeepSparse' s throughput can be pushed further if weights are pruned in blocks of 4. DeepSparse achieves 180 items/sec with a 4-block pruned-quantized YOLOv5s, a **3.7x performance gain over ONNX Runtime!**
122-
123- ```bash
124- deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned35_quant-none-vnni -s sync -b 1 -nstreams 1
125-
126- > Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned35_quant-none-vnni
127- > Batch Size: 1
128- > Scenario: sync
129- > Throughput (items/sec): 179.6016
130- > Latency Mean (ms/batch): 5.5615
131- > Latency Median (ms/batch): 5.5458
132- ```
133-
134- #### Batch 64 Performance Comparison
135-
136- In latency-insensitive scenarios with large batch sizes, DeepSparse' s performance relative to ONNX Runtime is even stronger.
137-
138- ONNX Runtime achieves 42 images/sec with dense YOLOv5s:
139-
140- ` ` ` bash
141- deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 64 -nstreams 1 -e onnxruntime
142-
143- > Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none
144- > Batch Size: 64
145- > Scenario: sync
146- > Throughput (items/sec): 41.5560
147- > Latency Mean (ms/batch): 1538.6640
148- > Latency Median (ms/batch): 1538.0362
149- ` ` `
150-
151- DeepSparse achieves 239 images/sec with pruned-quantized YOLOv5s, a ** 5.8x performance improvement over ORT** !
152-
153- ` ` ` bash
154- deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none -s sync -b 64 -nstreams 1
155-
156- > Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none
157- > Batch Size: 64
158- > Scenario: sync
159- > Throughput (items/sec): 239.0854
160- > Latency Mean (ms/batch): 267.6703
161- > Latency Median (ms/batch): 267.3194
16284```
16385
16486### Deploy a Model
16587
16688DeepSparse offers convenient APIs for integrating your model into an application.
16789
168- To try the deployment examples below, pull down a sample image for the example and save as ` basilica.jpg` with the following command :
90+ To try the deployment examples below, pull down a sample image and save it as ` basilica.jpg ` with the following:
16991``` bash
17092wget -O basilica.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpg
17193```
17294
173- # ### Annotate CLI
174- You can also use the annotate command to have the engine save an annotated photo on disk. Try --source 0 to annotate your live webcam feed!
175- ` ` ` bash
176- deepsparse.object_detection.annotate --model_filepath zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned_quant-aggressive_94 --source basilica.jpg
177- ` ` `
178-
179- Running the above command will create an ` annotation-results` folder and save the annotated image inside.
180-
181- < p align = " center" >
182- < img src=" https://github.com/neuralmagic/deepsparse/blob/d31f02596ebff2ec62761d0bc9ca14c4663e8858/src/deepsparse/yolo/sample_images/basilica-annotated.jpg" alt=" annotated" width=" 60%" />
183- < /p>
184-
18595#### Python API
18696
18797` Pipelines ` wrap pre-processing and output post-processing around the runtime, providing a clean inferface for adding DeepSparse to an application.
@@ -239,6 +149,110 @@ bounding_boxes = annotations["boxes"]
239149labels = annotations[" labels" ]
240150```
241151
152+ #### Annotate CLI
153+ You can also use the annotate command to have the engine save an annotated photo on disk. Try --source 0 to annotate your live webcam feed!
154+ ``` bash
155+ deepsparse.object_detection.annotate --model_filepath zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none --source basilica.jpg
156+ ```
157+
158+ Running the above command will create an ` annotation-results ` folder and save the annotated image inside.
159+
160+ <p align = " center " >
161+ <img src =" https://github.com/neuralmagic/deepsparse/blob/d31f02596ebff2ec62761d0bc9ca14c4663e8858/src/deepsparse/yolo/sample_images/basilica-annotated.jpg " alt =" annotated " width =" 60% " />
162+ </p >
163+
164+ ## Benchmarking Performance
165+
166+ We will compare DeepSparse's throughput to ONNX Runtime's throughput on YOLOv5s, using DeepSparse's benchmarking script.
167+
168+ The benchmarks were run on an AWS ` c6i.8xlarge ` instance (16 cores).
169+
170+ ### Batch 32 Performance Comparsion
171+
172+ #### ONNX Runtime Baseline
173+
174+ At batch 32, ONNX Runtime achieves 42 images/sec with the standard dense YOLOv5s:
175+
176+ ``` bash
177+ deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 32 -nstreams 1 -e onnxruntime
178+
179+ > Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none
180+ > Batch Size: 32
181+ > Scenario: sync
182+ > Throughput (items/sec): 41.9025
183+ ```
184+
185+ #### DeepSparse Dense Performance
186+
187+ While DeepSparse offers its best performance with optimized sparse models, it also performs well with the standard dense YOLOv5s.
188+
189+ At batch 32, DeepSparse achieves 70 images/sec with the standard dense YOLOv5s, a ** 1.7x performance improvement over ORT** !
190+
191+ ``` bash
192+ deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 32 -nstreams 1
193+
194+ > Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none
195+ > Batch Size: 32
196+ > Scenario: sync
197+ > Throughput (items/sec): 69.5546
198+ ```
199+ #### DeepSparse Sparse Performance
200+
201+ When sparsity is applied to the model, DeepSparse's performance gains over ONNX Runtime is even stronger.
202+
203+ At batch 32, DeepSparse achieves 241 images/sec with the pruned-quantized YOLOv5s, a ** 5.8x performance improvement over ORT** !
204+
205+ ``` bash
206+ deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none -s sync -b 32 -nstreams 1
207+
208+ > Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none
209+ > Batch Size: 32
210+ > Scenario: sync
211+ > Throughput (items/sec): 241.2452
212+ ```
213+
214+ ### Batch 1 Performance Comparison
215+
216+ DeepSparse is also able to gain a speed-up over ONNX Runtime for the latency-sensitive, batch 1 scenario.
217+
218+ #### ONNX Runtime Baseline
219+ At batch 1, ONNX Runtime achieves 48 images/sec with the standard, dense YOLOv5s.
220+
221+ ``` bash
222+ deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 1 -nstreams 1 -e onnxruntime
223+
224+ > Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none
225+ > Batch Size: 1
226+ > Scenario: sync
227+ > Throughput (items/sec): 48.0921
228+ ```
229+
230+ #### DeepSparse Sparse Performance
231+
232+ At batch 1, DeepSparse achieves 135 items/sec with a pruned-quantized YOLOv5s, ** a 2.8x performance gain over ONNX Runtime!**
233+
234+ ``` bash
235+ deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none -s sync -b 1 -nstreams 1
236+
237+ > Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none
238+ > Batch Size: 1
239+ > Scenario: sync
240+ > Throughput (items/sec): 134.9468
241+ ```
242+
243+ Since ` c6i.8xlarge ` instances have VNNI instructions, DeepSparse's throughput can be pushed further if weights are pruned in blocks of 4.
244+
245+ At batch 1, DeepSparse achieves 180 items/sec with a 4-block pruned-quantized YOLOv5s, a ** 3.7x performance gain over ONNX Runtime!**
246+
247+ ``` bash
248+ deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned35_quant-none-vnni -s sync -b 1 -nstreams 1
249+
250+ > Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned35_quant-none-vnni
251+ > Batch Size: 1
252+ > Scenario: sync
253+ > Throughput (items/sec): 179.7375
254+ ```
255+
242256## Get Started With DeepSparse
243257
244258** Research or Testing?** DeepSparse Community is free for research and testing. Get started with our [ Documentation] ( https://docs.neuralmagic.com/ ) .
0 commit comments