论文标题
深层神经网络中完全连接层的低潜伏期CMOS硬件加速度
Low Latency CMOS Hardware Acceleration for Fully Connected Layers in Deep Neural Networks
论文作者
论文摘要
我们提出了一个新型的低延迟CMOS硬件加速器,用于深神经网络(DNNS)中的完全连接(FC)层。 FC加速器FC-ACCL基于矩阵矢量乘法的128 8x8或16x16处理元件(PE),以及128个与128个高带宽内存(HBM)单元集成的多重蓄能(MAC)单位,用于存储预周围的重量。 Micro-architectural details for CMOS ASIC implementations are presented and simulated performance is compared to recent hardware accelerators for DNNs for AlexNet and VGG 16. When comparing simulated processing latency for a 4096-1000 FC8 layer, our FC-ACCL is able to achieve 48.4 GOPS (with a 100 MHz clock) which improves on a recent FC8 layer accelerator quoted at 28.8 GOPS with a 150 MHz时钟。我们通过充分利用HBM单元在1个周期中以新颖的Colum-Row-column时间表来存储和阅读特定于列的FCLAYER权重,并实现了最大平行的数据,以使用相应的MAC和PE单位来处理这些权重,我们实现了这一可观的改进。当将16x16瓷砖的重量升级至128 16x16 PE时,与使用压缩的替代EIE解决方案相比,Alexnet的大型FC6层的潜伏期可以减少60%,而VGG16的延迟则下降了3%。
We present a novel low latency CMOS hardware accelerator for fully connected (FC) layers in deep neural networks (DNNs). The FC accelerator, FC-ACCL, is based on 128 8x8 or 16x16 processing elements (PEs) for matrix-vector multiplication, and 128 multiply-accumulate (MAC) units integrated with 128 High Bandwidth Memory (HBM) units for storing the pretrained weights. Micro-architectural details for CMOS ASIC implementations are presented and simulated performance is compared to recent hardware accelerators for DNNs for AlexNet and VGG 16. When comparing simulated processing latency for a 4096-1000 FC8 layer, our FC-ACCL is able to achieve 48.4 GOPS (with a 100 MHz clock) which improves on a recent FC8 layer accelerator quoted at 28.8 GOPS with a 150 MHz clock. We have achieved this considerable improvement by fully utilizing the HBM units for storing and reading out column-specific FClayer weights in 1 cycle with a novel colum-row-column schedule, and implementing a maximally parallel datapath for processing these weights with the corresponding MAC and PE units. When up-scaled to 128 16x16 PEs, for 16x16 tiles of weights, the design can reduce latency for the large FC6 layer by 60 % in AlexNet and by 3 % in VGG16 when compared to an alternative EIE solution which uses compression.