diff --git a/examples/ultralytics-yolo/ultralytics-readmes/cluster-bar.png b/examples/ultralytics-yolo/ultralytics-readmes/cluster-bar.png deleted file mode 100644 index d4c2ba69df..0000000000 Binary files a/examples/ultralytics-yolo/ultralytics-readmes/cluster-bar.png and /dev/null differ diff --git a/examples/ultralytics-yolo/ultralytics-readmes/deepsparse-readme.md b/examples/ultralytics-yolo/ultralytics-readmes/deepsparse-readme.md index 890eae88b5..1a69271fb5 100644 --- a/examples/ultralytics-yolo/ultralytics-readmes/deepsparse-readme.md +++ b/examples/ultralytics-yolo/ultralytics-readmes/deepsparse-readme.md @@ -24,10 +24,10 @@ Welcome to software-delivered AI. This guide explains how to deploy YOLOv5 with Neural Magic's DeepSparse. -DeepSparse is an inference runtime with exceptional performance on CPUs. For instance, compared to ONNX Runtime's baseline, DeepSparse offers a 3.7x speed-up at batch size 1 and a 5.8x speed-up at batch size 64 for YOLOv5s! +DeepSparse is an inference runtime with exceptional performance on CPUs. For instance, compared to the ONNX Runtime baseline, DeepSparse offers a 5.8x speed-up for YOLOv5s, running on the same machine!

- +

For the first time, your deep learning workloads can meet the performance demands of production without the complexity and costs of hardware accelerators. @@ -77,111 +77,21 @@ DeepSparse accepts a model in the ONNX format, passed either as: - A SparseZoo stub which identifies an ONNX file in the SparseZoo - A local path to an ONNX model in a filesystem -The examples below will use the standard dense YOLOv5s and pruned-quantized YOLOv5s checkpoints, identified by the following SparseZoo stubs: +The examples below use the standard dense and pruned-quantized YOLOv5s checkpoints, identified by the following SparseZoo stubs: ```bash zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none -zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned35_quant-none-vnni # < pruned for VNNI machines -``` - -### Benchmark Performance - -We will compare DeepSparse's throughput to ONNX Runtime's throughput on YOLOv5s, using DeepSparse's benchmarking script. - -The benchmarks were run on an AWS `c6i.8xlarge` instance (16 cores). - -#### Batch 1 Performance Comparison - -ONNX Runtime achieves 49 images/sec with dense YOLOv5s. - -```bash -deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 1 -nstreams 1 -e onnxruntime - -> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -> Batch Size: 1 -> Scenario: sync -> Throughput (items/sec): 48.8549 -> Latency Mean (ms/batch): 20.4613 -> Latency Median (ms/batch): 20.4192 -``` - -DeepSparse achieves 135 items/sec with a pruned-quantized YOLOv5s, **a 2.8x performance gain over ONNX Runtime!** - -```bash -deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none -s sync -b 1 -nstreams 1 - -> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none -> Batch Size: 1 -> Scenario: sync -> Throughput (items/sec): 135.0647 -> Latency Mean (ms/batch): 7.3895 -> Latency Median (ms/batch): 7.2398``` -``` - -Since `c6i.8xlarge` instances have VNNI instructions, DeepSparse's throughput can be pushed further if weights are pruned in blocks of 4. DeepSparse achieves 180 items/sec with a 4-block pruned-quantized YOLOv5s, a **3.7x performance gain over ONNX Runtime!** - -```bash -deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned35_quant-none-vnni -s sync -b 1 -nstreams 1 - -> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned35_quant-none-vnni -> Batch Size: 1 -> Scenario: sync -> Throughput (items/sec): 179.6016 -> Latency Mean (ms/batch): 5.5615 -> Latency Median (ms/batch): 5.5458 -``` - -#### Batch 64 Performance Comparison - -In latency-insensitive scenarios with large batch sizes, DeepSparse's performance relative to ONNX Runtime is even stronger. - -ONNX Runtime achieves 42 images/sec with dense YOLOv5s: - -```bash -deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 64 -nstreams 1 -e onnxruntime - -> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -> Batch Size: 64 -> Scenario: sync -> Throughput (items/sec): 41.5560 -> Latency Mean (ms/batch): 1538.6640 -> Latency Median (ms/batch): 1538.0362 -``` - -DeepSparse achieves 239 images/sec with pruned-quantized YOLOv5s, a **5.8x performance improvement over ORT**! - -```bash -deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none -s sync -b 64 -nstreams 1 - -> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none -> Batch Size: 64 -> Scenario: sync -> Throughput (items/sec): 239.0854 -> Latency Mean (ms/batch): 267.6703 -> Latency Median (ms/batch): 267.3194 ``` ### Deploy a Model DeepSparse offers convenient APIs for integrating your model into an application. -To try the deployment examples below, pull down a sample image for the example and save as `basilica.jpg` with the following command: +To try the deployment examples below, pull down a sample image and save it as `basilica.jpg` with the following: ```bash wget -O basilica.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpg ``` -#### Annotate CLI -You can also use the annotate command to have the engine save an annotated photo on disk. Try --source 0 to annotate your live webcam feed! -```bash -deepsparse.object_detection.annotate --model_filepath zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned_quant-aggressive_94 --source basilica.jpg -``` - -Running the above command will create an `annotation-results` folder and save the annotated image inside. - -

-annotated -

- #### Python API `Pipelines` wrap pre-processing and output post-processing around the runtime, providing a clean inferface for adding DeepSparse to an application. @@ -239,6 +149,110 @@ bounding_boxes = annotations["boxes"] labels = annotations["labels"] ``` +#### Annotate CLI +You can also use the annotate command to have the engine save an annotated photo on disk. Try --source 0 to annotate your live webcam feed! +```bash +deepsparse.object_detection.annotate --model_filepath zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none --source basilica.jpg +``` + +Running the above command will create an `annotation-results` folder and save the annotated image inside. + +

+annotated +

+ +## Benchmarking Performance + +We will compare DeepSparse's throughput to ONNX Runtime's throughput on YOLOv5s, using DeepSparse's benchmarking script. + +The benchmarks were run on an AWS `c6i.8xlarge` instance (16 cores). + +### Batch 32 Performance Comparsion + +#### ONNX Runtime Baseline + +At batch 32, ONNX Runtime achieves 42 images/sec with the standard dense YOLOv5s: + +```bash +deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 32 -nstreams 1 -e onnxruntime + +> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none +> Batch Size: 32 +> Scenario: sync +> Throughput (items/sec): 41.9025 +``` + +#### DeepSparse Dense Performance + +While DeepSparse offers its best performance with optimized sparse models, it also performs well with the standard dense YOLOv5s. + +At batch 32, DeepSparse achieves 70 images/sec with the standard dense YOLOv5s, a **1.7x performance improvement over ORT**! + +```bash +deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 32 -nstreams 1 + +> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none +> Batch Size: 32 +> Scenario: sync +> Throughput (items/sec): 69.5546 +``` +#### DeepSparse Sparse Performance + +When sparsity is applied to the model, DeepSparse's performance gains over ONNX Runtime is even stronger. + +At batch 32, DeepSparse achieves 241 images/sec with the pruned-quantized YOLOv5s, a **5.8x performance improvement over ORT**! + +```bash +deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none -s sync -b 32 -nstreams 1 + +> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none +> Batch Size: 32 +> Scenario: sync +> Throughput (items/sec): 241.2452 +``` + +### Batch 1 Performance Comparison + +DeepSparse is also able to gain a speed-up over ONNX Runtime for the latency-sensitive, batch 1 scenario. + +#### ONNX Runtime Baseline +At batch 1, ONNX Runtime achieves 48 images/sec with the standard, dense YOLOv5s. + +```bash +deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 1 -nstreams 1 -e onnxruntime + +> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none +> Batch Size: 1 +> Scenario: sync +> Throughput (items/sec): 48.0921 +``` + +#### DeepSparse Sparse Performance + +At batch 1, DeepSparse achieves 135 items/sec with a pruned-quantized YOLOv5s, **a 2.8x performance gain over ONNX Runtime!** + +```bash +deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none -s sync -b 1 -nstreams 1 + +> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none +> Batch Size: 1 +> Scenario: sync +> Throughput (items/sec): 134.9468 +``` + +Since `c6i.8xlarge` instances have VNNI instructions, DeepSparse's throughput can be pushed further if weights are pruned in blocks of 4. + +At batch 1, DeepSparse achieves 180 items/sec with a 4-block pruned-quantized YOLOv5s, a **3.7x performance gain over ONNX Runtime!** + +```bash +deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned35_quant-none-vnni -s sync -b 1 -nstreams 1 + +> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned35_quant-none-vnni +> Batch Size: 1 +> Scenario: sync +> Throughput (items/sec): 179.7375 +``` + ## Get Started With DeepSparse **Research or Testing?** DeepSparse Community is free for research and testing. Get started with our [Documentation](https://docs.neuralmagic.com/). diff --git a/examples/ultralytics-yolo/ultralytics-readmes/performance-chart-5.8x.png b/examples/ultralytics-yolo/ultralytics-readmes/performance-chart-5.8x.png new file mode 100644 index 0000000000..9ef5ebc1d4 Binary files /dev/null and b/examples/ultralytics-yolo/ultralytics-readmes/performance-chart-5.8x.png differ diff --git a/examples/ultralytics-yolo/ultralytics-readmes/sparse-network.svg b/examples/ultralytics-yolo/ultralytics-readmes/sparse-network.svg deleted file mode 100644 index c45f5bd433..0000000000 --- a/examples/ultralytics-yolo/ultralytics-readmes/sparse-network.svg +++ /dev/null @@ -1 +0,0 @@ - figure2cHardware AcceleratorsCPU(GPU, TPU, etc.)CachesCoresWrite/ReadWrite/ReadWrite/ReadTensor Column Depth-wise Execution AlgorithmCachesCoresCPU \ No newline at end of file