FPGA的结构化稀疏卷积神经网络的有效硬件加速器

论文标题

FPGA的结构化稀疏卷积神经网络的有效硬件加速器

An Efficient Hardware Accelerator for Structured Sparse Convolutional Neural Networks on FPGAs

论文作者

Zhu, Chaoyang, Huang, Kejie, Yang, Shuyuan, Zhu, Ziqi, Zhang, Hejia, Shen, Haibin

论文摘要

深度卷积神经网络（CNN）已在广泛的应用中实现了最先进的性能。但是，更深层次的CNN模型通常是计算消耗的，它是复杂人工智能（AI）任务所需的。尽管最近在修剪等网络压缩方面的研究进展已成为减轻计算负担的有希望的方向，但由于修剪造成的不规则性，现有的加速器仍无法完全利用利用稀疏性的好处。另一方面，现场编程的门阵列（FPGA）被视为CNN推理加速度的有前途的硬件平台。但是，大多数现有的FPGA加速器都专注于密集的CNN，无法解决不规则性问题。在本文中，我们提出了一个稀疏的明智数据流，以跳过零重量和利用数据统计量的多重和蓄能（MAC）的周期，以通过零体门控来最大程度地减少能量，以避免不必要的计算。提出的稀疏明智数据流导致较低的带宽要求和高数据共享。然后，我们设计了一个含有矢量发生器模块（VGM）的FPGA加速器，该模块可以根据建议的数据流匹配稀疏权重和输入激活之间的索引。实验结果表明，我们的实施可以在Xilinx ZCU102上分别实现Alexnet和VGG-16的987 Imag/s和48 Imag/S性能，在先前的CNN FPGA加速器上，它提供了1.5倍至6.7倍的加速和2.0倍至6.2倍的能效。

Deep Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance in a wide range of applications. However, deeper CNN models, which are usually computation consuming, are widely required for complex Artificial Intelligence (AI) tasks. Though recent research progress on network compression such as pruning has emerged as a promising direction to mitigate computational burden, existing accelerators are still prevented from completely utilizing the benefits of leveraging sparsity owing to the irregularity caused by pruning. On the other hand, Field-Programmable Gate Arrays (FPGAs) have been regarded as a promising hardware platform for CNN inference acceleration. However, most existing FPGA accelerators focus on dense CNN and cannot address the irregularity problem. In this paper, we propose a sparse wise dataflow to skip the cycles of processing Multiply-and-Accumulates (MACs) with zero weights and exploit data statistics to minimize energy through zeros gating to avoid unnecessary computations. The proposed sparse wise dataflow leads to a low bandwidth requirement and a high data sharing. Then we design an FPGA accelerator containing a Vector Generator Module (VGM) which can match the index between sparse weights and input activations according to the proposed dataflow. Experimental results demonstrate that our implementation can achieve 987 imag/s and 48 imag/s performance for AlexNet and VGG-16 on Xilinx ZCU102, respectively, which provides 1.5x to 6.7x speedup and 2.0x to 6.2x energy-efficiency over previous CNN FPGA accelerators.

下载PDF全文

下载文献需遵守相关版权规定

论文标题