YOLOv3 on a Laptop Example
YOLOv3 on a Laptop Example

Sparsifying YOLOv3 (or any other model) involves removing redundant information from neural networks using algorithms such as pruning and quantization, among others. This sparsification process results in many benefits for deployment environments, including faster inference and smaller file sizes. Unfortunately, many have not realized the benefits due to the complicated process and number of hyperparameters involved.

Neural Magic’s ML team created recipes encoding the necessary hyperparameters and instructions to create highly accurate pruned and pruned-quantized YOLOv3 models to simplify the process. …


Use CPUs to decrease costs and increase deployment flexibility while still achieving GPU-class performance.

In this post, we elaborate on how we used state-of-the-art pruning and quantization techniques to improve the performance of the YOLOv3 on CPUs. We’ll show that by leveraging the robust YOLO training framework from Ultralytics with SparseML’s sparsification recipes it is easy to create highly pruned and INT8 quantized YOLO models that deliver more than a 6x increase in performance over state-of-the-art PyTorch and ONNX Runtime CPU implementations. Lastly, we’ll show that model sparsification (pruning and quantization) doesn’t have to be a hard and daunting task when using Neural Magic open-source tools and recipe-driven approaches.

Comparison of the real-time performance of YOLOv3 (batch size 1) for different CPU implementations to common GPU benchmarks.
Comparison of the real-time performance of YOLOv3 (batch size 1) for different CPU implementations to common GPU benchmarks.
Figure 1: Comparison of the real-time performance of YOLOv3 (batch size 1) for different CPU implementations to common GPU benchmarks.
* ONNX Runtime has a known performance regression for quantized YOLOv3 currently on larger core counts.


While mapping the neural connections in the brain at MIT, Neural Magic’s founders Nir Shavit and Alexander Matveev were frustrated with the many limitations imposed by GPUs. Along the way, they stopped to ask themselves a simple question: why is a GPU, or any specialized [and expensive] hardware, required for deep learning?

They knew there had to be a better way. After all, the human brain addresses the computational needs of neural networks by extensively using sparsity to reduce them instead of adding FLOPS to match them. If our brains processed information the same way today’s hardware accelerators consume computing…


In this post, we elaborate on how we measured, on commodity cloud hardware, the throughput and latency of five ResNet-50 v1 models optimized for CPU inference. By the end of the post, you should be able reproduce these benchmarks using tools available in the Neural Magic GitHub repo.

ResNet-50 Throughput
ResNet-50 Throughput
ResNet-50 v1 | Batch = 64 | AWS c5.12xlarge CPU

Neural Magic

Optimize your DL models with ease. Run on CPUs at GPU speeds. The future of #deeplearning is sparse.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store