论文标题
LCP:用于图像识别中快速神经网络推断的低通信并行化方法
LCP: A Low-Communication Parallelization Method for Fast Neural Network Inference in Image Recognition
论文作者
论文摘要
深度神经网络(DNNS)激发了使用机器人,自主代理和图像互联网(IoT)设备的无数边缘应用中的新研究。但是,在边缘进行DNN的推断仍然是一个严重的挑战,主要是因为DNN的密集资源要求与多个边缘域中的严格资源可用性之间存在矛盾。此外,随着通信的昂贵,使用数据或模型并行方法利用其他可用的边缘设备不是有效的解决方案。为了从可用的通信开销低的可用计算资源中受益,我们提出了第一种DNN并行化方法,用于减少分布式系统中的通信开销。我们提出了一种低通信并行化(LCP)方法,其中模型由几个几乎独立和狭窄的分支组成。与数据和模型并行方法相比,LCP提供了接近最小的通信开销,并具有更好的分布和并行的机会,同时可显着降低内存足迹和计算。我们在三个分布式系统上部署LCP模型:AWS实例,Raspberry PI和Pynq板。我们还评估了在小边缘FPGA上实现的定制硬件(量身定制的低潜伏期)上的LCP模型的性能,并作为16MW 0.107mm2 ASIC @7nm芯片。与原件相比,LCP模型的最大和平均速度为56倍和7倍,通过结合诸如修剪和量化之类的常见优化,可以提高最高33倍的平均速度。
Deep neural networks (DNNs) have inspired new studies in myriad edge applications with robots, autonomous agents, and Internet-of-things (IoT) devices. However, performing inference of DNNs in the edge is still a severe challenge, mainly because of the contradiction between the intensive resource requirements of DNNs and the tight resource availability in several edge domains. Further, as communication is costly, taking advantage of other available edge devices by using data- or model-parallelism methods is not an effective solution. To benefit from available compute resources with low communication overhead, we propose the first DNN parallelization method for reducing the communication overhead in a distributed system. We propose a low-communication parallelization (LCP) method in which models consist of several almost-independent and narrow branches. LCP offers close-to-minimum communication overhead with better distribution and parallelization opportunities while significantly reducing memory footprint and computation compared to data- and model-parallelism methods. We deploy LCP models on three distributed systems: AWS instances, Raspberry Pis, and PYNQ boards. We also evaluate the performance of LCP models on a customized hardware (tailored for low latency) implemented on a small edge FPGA and as a 16mW 0.107mm2 ASIC @7nm chip. LCP models achieve a maximum and average speedups of 56x and 7x, compared to the originals, which could be improved by up to an average speedup of 33x by incorporating common optimizations such as pruning and quantization.