通过超低位量化和运行时加速对ARM CPU的深度学习模型推断

论文标题

通过超低位量化和运行时加速对ARM CPU的深度学习模型推断

Accelerating Deep Learning Model Inference on Arm CPUs with Ultra-Low Bit Quantization and Runtime

论文作者

Ashfaq, Saad, AskariHemmat, MohammadHossein, Sah, Sudhakar, Saboori, Ehsan, Mastropietro, Olivier, Hoffman, Alexander

论文摘要

深度学习一直是近来最具破坏性的技术进步之一。深度学习模型的高性能以高度计算，存储和功率要求为代价。感知到加速和压缩这些模型以提高设备性能的直接需求，我们引入了Deeplite Neutrino，用于生产准备就绪的模型和Deeplite运行时，以在基于ARM的平台上部署超低位量化模型。我们为ARMV7和ARMV8架构实施低级量化内核，可在32位和64位基于ARM的设备上部署。通过使用矢量化，并行化和平铺的有效实现，与具有分类和检测模型的XNNPACK后端相比，我们实现了高达2倍和2.2倍的加速度。与ONNX运行时相比，我们还达到了高达5倍和3.2倍的显着加速，分别用于分类和检测模型。

Deep Learning has been one of the most disruptive technological advancements in recent times. The high performance of deep learning models comes at the expense of high computational, storage and power requirements. Sensing the immediate need for accelerating and compressing these models to improve on-device performance, we introduce Deeplite Neutrino for production-ready optimization of the models and Deeplite Runtime for deployment of ultra-low bit quantized models on Arm-based platforms. We implement low-level quantization kernels for Armv7 and Armv8 architectures enabling deployment on the vast array of 32-bit and 64-bit Arm-based devices. With efficient implementations using vectorization, parallelization, and tiling, we realize speedups of up to 2x and 2.2x compared to TensorFlow Lite with XNNPACK backend on classification and detection models, respectively. We also achieve significant speedups of up to 5x and 3.2x compared to ONNX Runtime for classification and detection models, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题