论文标题
CNN推理加速器在FPGA上生成的汇编流
A Compilation Flow for the Generation of CNN Inference Accelerators on FPGAs
论文作者
论文摘要
我们提出了CNN推理加速器在FPGA上产生的汇编流。该流量将冷冻模型转换为使用TVM编译器的OpenCL内核,并使用Intel Opencl SDK编译为FPGA Bitstream。我们利用适用于TVM生成的基本OPENCL内核的优化来提高生成硬件的质量。这些优化增加并行性,减少内存访问延迟,增加并发性并节省片上资源。我们在TVM中对这些优化进行自动化,并通过在Intel Stratix〜10SX上生成Lenet-5,MobilenetV1和Resnet-34的加速器来对其进行评估。我们表明,优化提高了基本加速器的高达846倍的生成加速器的性能。优化加速器的性能比CPU上的Tensorflow高4.57倍,比单线读取TVM好3.83倍,与具有56个线程的TVM相比,仅为0.34倍。我们优化的内核还优于类似方法(也使用高级合成)生成的内核,同时提供了更多的功能和灵活性。但是,它的表现不足一种使用手工优化设计的方法。因此,我们将我们的方法视为在制作预生产环境中有用的,这些环境受益于提高性能和快速原型,从而实现了没有硬件设计专业知识的FPGA的好处。
We present a compilation flow for the generation of CNN inference accelerators on FPGAs. The flow translates a frozen model into OpenCL kernels with the TVM compiler and uses the Intel OpenCL SDK to compile to an FPGA bitstream. We improve the quality of the generated hardware with optimizations applied to the base OpenCL kernels generated by TVM. These optimizations increase parallelism, reduce memory access latency, increase concurrency and save on-chip resources. We automate these optimizations in TVM and evaluate them by generating accelerators for LeNet-5, MobileNetV1 and ResNet-34 on an Intel Stratix~10SX. We show that the optimizations improve the performance of the generated accelerators by up to 846X over the base accelerators. The performance of the optimized accelerators is up to 4.57X better than TensorFlow on CPU, 3.83X better than single-threaded TVM and is only 0.34X compared to TVM with 56 threads. Our optimized kernels also outperform ones generated by a similar approach (that also uses high-level synthesis) while providing more functionality and flexibility. However, it underperforms an approach that utilizes hand-optimized designs. Thus, we view our approach as useful in pre-production environments that benefit from increased performance and fast prototyping, realizing the benefits of FPGAs without hardware design expertise.