Sparch：稀疏矩阵乘法的有效体系结构

论文标题

Sparch：稀疏矩阵乘法的有效体系结构

SpArch: Efficient Architecture for Sparse Matrix Multiplication

论文作者

Zhang, Zhekai, Wang, Hanrui, Han, Song, Dally, William J.

论文摘要

在各种工程和科学应用中，广义稀疏基质矩阵乘法（SPGEMM）是无处不在的任务。但是，基于内部产品的SPGENN引入了针对不匹配的非零操作数的冗余输入提取物，而基于外部产品的方法由于众多部分产品矩阵而遭受较差的输出区域。重复使用或输出数据的效率低下会导致广泛且昂贵的DRAM访问。为了解决这个问题，本文提出了有效的稀疏矩阵乘法加速器体系结构Sparch，该结构共同优化了输入和输出矩阵的数据局部性。我们首先设计了一个高度平行的基于流的合并，以将部分矩阵的多重和合并阶段管道，以便在产生后立即将部分矩阵合并在芯片上。然后，我们提出了一个冷凝的矩阵表示，该表示将部分矩阵的数量减少了三个数量级，从而将DRAM访问降低了5.4倍。我们进一步开发了霍夫曼树调度程序，以提高合并的可扩展性，以减少稀疏矩阵，从而将DRAM访问降低了1.8倍。我们还可以使用具有近乎最佳缓冲区替换策略的行预摘要来解决新表示引起的增加的输入矩阵读取，从而进一步将DRAM访问降低了1.5倍。 Sparch在20个基准测试中进行了评估，将总DRAM访问量降低了2.8倍，而不是先前的最先前。平均而言，Sparch分别达到4倍，19倍，18X，17倍，1285x速度和6x，164x，435x，307x，62x能量节省，分别为Outerspace，MKL，Cusparse，Cusparse，cusp和Arm Armadillo。

Generalized Sparse Matrix-Matrix Multiplication (SpGEMM) is a ubiquitous task in various engineering and scientific applications. However, inner product based SpGENN introduces redundant input fetches for mismatched nonzero operands, while outer product based approach suffers from poor output locality due to numerous partial product matrices. Inefficiency in the reuse of either inputs or outputs data leads to extensive and expensive DRAM access. To address this problem, this paper proposes an efficient sparse matrix multiplication accelerator architecture, SpArch, which jointly optimizes the data locality for both input and output matrices. We first design a highly parallelized streaming-based merger to pipeline the multiply and merge stage of partial matrices so that partial matrices are merged on chip immediately after produced. We then propose a condensed matrix representation that reduces the number of partial matrices by three orders of magnitude and thus reduces DRAM access by 5.4x. We further develop a Huffman tree scheduler to improve the scalability of the merger for larger sparse matrices, which reduces the DRAM access by another 1.8x. We also resolve the increased input matrix read induced by the new representation using a row prefetcher with near-optimal buffer replacement policy, further reducing the DRAM access by 1.5x. Evaluated on 20 benchmarks, SpArch reduces the total DRAM access by 2.8x over previous state-of-the-art. On average, SpArch achieves 4x, 19x, 18x, 17x, 1285x speedup and 6x, 164x, 435x, 307x, 62x energy savings over OuterSPACE, MKL, cuSPARSE, CUSP, and ARM Armadillo, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题