深层神经网络中完全连接层的低潜伏期CMOS硬件加速度

论文标题

深层神经网络中完全连接层的低潜伏期CMOS硬件加速度

Low Latency CMOS Hardware Acceleration for Fully Connected Layers in Deep Neural Networks

论文作者

Iliev, Nick, Trivedi, Amit Ranjan

论文摘要

我们提出了一个新型的低延迟CMOS硬件加速器，用于深神经网络（DNNS）中的完全连接（FC）层。 FC加速器FC-ACCL基于矩阵矢量乘法的128 8x8或16x16处理元件（PE），以及128个与128个高带宽内存（HBM）单元集成的多重蓄能（MAC）单位，用于存储预周围的重量。 Micro-architectural details for CMOS ASIC implementations are presented and simulated performance is compared to recent hardware accelerators for DNNs for AlexNet and VGG 16. When comparing simulated processing latency for a 4096-1000 FC8 layer, our FC-ACCL is able to achieve 48.4 GOPS (with a 100 MHz clock) which improves on a recent FC8 layer accelerator quoted at 28.8 GOPS with a 150 MHz时钟。我们通过充分利用HBM单元在1个周期中以新颖的Colum-Row-column时间表来存储和阅读特定于列的FCLAYER权重，并实现了最大平行的数据，以使用相应的MAC和PE单位来处理这些权重，我们实现了这一可观的改进。当将16x16瓷砖的重量升级至128 16x16 PE时，与使用压缩的替代EIE解决方案相比，Alexnet的大型FC6层的潜伏期可以减少60％，而VGG16的延迟则下降了3％。

We present a novel low latency CMOS hardware accelerator for fully connected (FC) layers in deep neural networks (DNNs). The FC accelerator, FC-ACCL, is based on 128 8x8 or 16x16 processing elements (PEs) for matrix-vector multiplication, and 128 multiply-accumulate (MAC) units integrated with 128 High Bandwidth Memory (HBM) units for storing the pretrained weights. Micro-architectural details for CMOS ASIC implementations are presented and simulated performance is compared to recent hardware accelerators for DNNs for AlexNet and VGG 16. When comparing simulated processing latency for a 4096-1000 FC8 layer, our FC-ACCL is able to achieve 48.4 GOPS (with a 100 MHz clock) which improves on a recent FC8 layer accelerator quoted at 28.8 GOPS with a 150 MHz clock. We have achieved this considerable improvement by fully utilizing the HBM units for storing and reading out column-specific FClayer weights in 1 cycle with a novel colum-row-column schedule, and implementing a maximally parallel datapath for processing these weights with the corresponding MAC and PE units. When up-scaled to 128 16x16 PEs, for 16x16 tiles of weights, the design can reduce latency for the large FC6 layer by 60 % in AlexNet and by 3 % in VGG16 when compared to an alternative EIE solution which uses compression.

下载PDF全文

下载文献需遵守相关版权规定

论文标题