Sparsifying YOLOv3 (or any other model) involves removing redundant information from neural networks using algorithms such as pruning and quantization, among others. This sparsification process results in many benefits for deployment environments, including faster inference and smaller file sizes. Unfortunately, many have not realized the benefits due to the complicated process and number of hyperparameters involved.
Neural Magic’s ML team created recipes encoding the necessary hyperparameters and instructions to create highly accurate pruned and pruned-quantized YOLOv3 models to simplify the process. …
In this post, we elaborate on how we used state-of-the-art pruning and quantization techniques to improve the performance of the YOLOv3 on CPUs. We’ll show that by leveraging the robust YOLO training framework from Ultralytics with SparseML’s sparsification recipes it is easy to create highly pruned and INT8 quantized YOLO models that deliver more than a 6x increase in performance over state-of-the-art PyTorch and ONNX Runtime CPU implementations. Lastly, we’ll show that model sparsification (pruning and quantization) doesn’t have to be a hard and daunting task when using Neural Magic open-source tools and recipe-driven approaches.
While mapping the neural connections in the brain at MIT, Neural Magic’s founders Nir Shavit and Alexander Matveev were frustrated with the many limitations imposed by GPUs. Along the way, they stopped to ask themselves a simple question: why is a GPU, or any specialized [and expensive] hardware, required for deep learning?
They knew there had to be a better way. After all, the human brain addresses the computational needs of neural networks by extensively using sparsity to reduce them instead of adding FLOPS to match them. If our brains processed information the same way today’s hardware accelerators consume computing…
In this post, we elaborate on how we measured, on commodity cloud hardware, the throughput and latency of five ResNet-50 v1 models optimized for CPU inference. By the end of the post, you should be able reproduce these benchmarks using tools available in the Neural Magic GitHub repo.
Optimize your DL models with ease. Run on CPUs at GPU speeds. The future of #deeplearning is sparse.