使用自定义的稀疏存储格式，在GPU上有效稀疏密度矩阵矩阵乘法

论文标题

使用自定义的稀疏存储格式，在GPU上有效稀疏密度矩阵矩阵乘法

Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format

论文作者

Shi, Shaohuai, Wang, Qiang, Chu, Xiaowen

论文摘要

将稀疏基质乘以致密矩阵（SPDM）被广泛用于科学计算和机器学习等许多领域。但是，现有作品在现代多核体系结构（例如GPU）上的SPDM的性能优化不足。存储数据结构有助于以节省内存的格式稀疏矩阵存储，但由于稀疏结构的不规则数据访问，它们在现代GPU上优化SPDM的性能带来了困难，从而导致资源利用率较低和性能较差。在本文中，我们指的是GPU的车顶线绩效模型，以设计一种称为GCOOSPDM的高效SPDM算法，在该算法中，我们利用了共同的全球内存访问，快速共享内存重复使用和每个字节的全球内存流量操作。使用大量矩阵，包括公共数据集和随机生成的矩阵，在三个NVIDIA GPU（即GTX 980，GTX 980，GTX 980，GTX 980，GTX 980，GTX titan X Pascal和Tesla P100）中进行了评估。实验结果表明，GCOOSPDM在许多矩阵中都超过了Nvidia的图书馆Cusparse，获得了1.5-8 $ \ times $。我们还分析了特定GPU的指导级操作，以了解GCOOSPDM和Cusparse之间的性能差距。介绍的指令证实，Cusparse花费了大量时间在慢速内存访问（包括DRAM访问和L2高速缓存访问）上，而GCOOSPDM转移了这种慢速内存访问更快的共享内存，这主要有助于性能增长。结果还表明，GCOOSPDM的表现将胜过比GPU上的cusparse少量的密集算法（Cublas）。

Multiplication of a sparse matrix to a dense matrix (SpDM) is widely used in many areas like scientific computing and machine learning. However, existing works under-look the performance optimization of SpDM on modern many-core architectures like GPUs. The storage data structures help sparse matrices store in a memory-saving format, but they bring difficulties in optimizing the performance of SpDM on modern GPUs due to irregular data access of the sparse structure, which results in lower resource utilization and poorer performance. In this paper, we refer to the roofline performance model of GPUs to design an efficient SpDM algorithm called GCOOSpDM, in which we exploit coalescent global memory access, fast shared memory reuse and more operations per byte of global memory traffic. Experiments are evaluated on three Nvidia GPUs (i.e., GTX 980, GTX Titan X Pascal and Tesla P100) with CUDA-8.0 using a large number of matrices including a public dataset and randomly generated matrices. Experimental results show that GCOOSpDM achieves 1.5-8$\times$ speedup over Nvidia's library cuSPARSE in many matrices. We also analyze instruction-level operations on a particular GPU to understand the performance gap between GCOOSpDM and cuSPARSE. The profiled instructions confirm that cuSPARSE spends a lot of time on slow memory access (including DRAM access and L2 cache access), while GCOOSpDM transfers such slow memory access to faster shared memory, which mainly contributes to the performance gain. Results also show that GCOOSpDM would outperform the dense algorithm (cuBLAS) with lower sparsity than cuSPARSE on GPUs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题