HPipe：FPGA的异构层层衬里和稀疏感知的CNN推断

论文标题

HPipe：FPGA的异构层层衬里和稀疏感知的CNN推断

HPIPE: Heterogeneous Layer-Pipelined and Sparse-Aware CNN Inference for FPGAs

论文作者

Hall, Mathew, Betz, Vaughn

论文摘要

我们既提出了一种新颖的卷积神经网络（CNN）加速器架构，也提出了FPGA的网络编译器，它表现优于所有先前的工作。我们的网络编译器静态分区可用的设备资源并为CNN的每一层构建定制的硬件，而不是一次处理一层的通用处理元素。通过为每一层构建硬件，我们可以将控制器打包成更少的查找表，并使用专用的路由。这些效率使我们的加速器能够利用2倍DSP，并以超过2倍的速度运行，这是FPGA上稀疏CNN加速度的先前工作频率。我们在Stratix 10 2800 FPGA上评估了在稀疏的Resnet-50和密集的Mobilenet Imagenet分类器上评估架构的性能。我们发现，稀疏的Resnet-50模型的吞吐量为4550图像中的1个，几乎是NVIDIA最快的机器学习目标GPU，V100的吞吐量，V100的吞吐量均优于FPGA上的所有先前工作。

We present both a novel Convolutional Neural Network (CNN) accelerator architecture and a network compiler for FPGAs that outperforms all prior work. Instead of having generic processing elements that together process one layer at a time, our network compiler statically partitions available device resources and builds custom-tailored hardware for each layer of a CNN. By building hardware for each layer we can pack our controllers into fewer lookup tables and use dedicated routing. These efficiencies enable our accelerator to utilize 2x the DSPs and operate at more than 2x the frequency of prior work on sparse CNN acceleration on FPGAs. We evaluate the performance of our architecture on both sparse Resnet-50 and dense MobileNet Imagenet classifiers on a Stratix 10 2800 FPGA. We find that the sparse Resnet-50 model has throughput at a batch size of 1 of 4550 images/s, which is nearly 4x the throughput of NVIDIA's fastest machine learning targeted GPU, the V100, and outperforms all prior work on FPGAs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题