分布式内存系统上的可扩展图卷积网络培训

论文标题

分布式内存系统上的可扩展图卷积网络培训

Scalable Graph Convolutional Network Training on Distributed-Memory Systems

论文作者

Demirci, Gunduz Vehbi, Haldar, Aparajita, Ferhatosmanoglu, Hakan

论文摘要

图形卷积网络（GCN）被广泛用于图形上的深度学习。图形及其顶点功能的大数据尺寸使可扩展的训练算法和分布式内存系统必需。由于图上的卷积操作会引起不规则的内存访问模式，因此为GCN培训设计一种内存和通信有效的并行算法会带来独特的挑战。我们提出了一种高度平行的训练算法，该算法扩展到大型处理器计数。在我们的解决方案中，大型邻接和顶点功能矩阵在处理器之间分配。我们利用图形的顶点分区来使用处理器之间的非障碍物点对点通信操作，以更好地可扩展性。为了进一步最大程度地减少并行的开销，我们基于用于全批训练的超图形分区模型引入了稀疏矩阵分区方案。我们还提出了一种新型的随机超图模型，以在迷你批次训练中编码预期的通信量。我们在标准的图形分区模型上显示了以前未针对GCN培训的HyperGraph模型的优点，该模型无法准确编码通信成本。在现实世界图数据集上执行的实验表明，所提出的算法在替代解决方案上实现了相当大的加速。通过许多处理器，在高可扩展性下，对通信成本实现的优化变得更加明显。性能优势保留在具有更多层和十亿尺度图的更深层次的GCN中。

Graph Convolutional Networks (GCNs) are extensively utilized for deep learning on graphs. The large data sizes of graphs and their vertex features make scalable training algorithms and distributed memory systems necessary. Since the convolution operation on graphs induces irregular memory access patterns, designing a memory- and communication-efficient parallel algorithm for GCN training poses unique challenges. We propose a highly parallel training algorithm that scales to large processor counts. In our solution, the large adjacency and vertex-feature matrices are partitioned among processors. We exploit the vertex-partitioning of the graph to use non-blocking point-to-point communication operations between processors for better scalability. To further minimize the parallelization overheads, we introduce a sparse matrix partitioning scheme based on a hypergraph partitioning model for full-batch training. We also propose a novel stochastic hypergraph model to encode the expected communication volume in mini-batch training. We show the merits of the hypergraph model, previously unexplored for GCN training, over the standard graph partitioning model which does not accurately encode the communication costs. Experiments performed on real-world graph datasets demonstrate that the proposed algorithms achieve considerable speedups over alternative solutions. The optimizations achieved on communication costs become even more pronounced at high scalability with many processors. The performance benefits are preserved in deeper GCNs having more layers as well as on billion-scale graphs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题