用于大规模节点的分布式多GPU系统在腾讯处嵌入

论文标题

用于大规模节点的分布式多GPU系统在腾讯处嵌入

A Distributed Multi-GPU System for Large-Scale Node Embedding at Tencent

论文作者

Wei, Wanjing, Wang, Yangzihao, Gao, Pin, Sun, Shijie, Yu, Donghai

论文摘要

真实的节点嵌入应用程序通常包含具有高维节点特征的数百万亿个边缘。缩放节点嵌入系统有效地支持这些应用程序仍然是一个具有挑战性的问题。在本文中，我们提出了高性能的多GPU节点嵌入系统。它使用模型并行性将节点嵌入到每个GPU的局部参数服务器上，并且数据并行性将这些嵌入在不同的边缘样本上训练这些嵌入。我们提出了分层数据分配策略和嵌入培训管道，以优化GPU群集上的通信和内存使用情况。借助CPU任务（随机步行）和GPU任务（嵌入培训）的脱钩设计，我们的系统非常灵活，并且可以在GPU群集上充分利用所有计算资源。与当前最新的多GPU单节嵌入系统相比，我们的系统平均实现了5.9倍14.4倍的速度，以竞争性或更高的精度在开放数据集中获得。在一个网络上使用40个NVIDIA V100 GPU，该网络具有近3亿个边缘和超过10亿个节点，我们的实施仅需要3分钟才能完成一个训练时期。

Real-world node embedding applications often contain hundreds of billions of edges with high-dimension node features. Scaling node embedding systems to efficiently support these applications remains a challenging problem. In this paper we present a high-performance multi-GPU node embedding system. It uses model parallelism to split node embeddings onto each GPU's local parameter server, and data parallelism to train these embeddings on different edge samples in parallel. We propose a hierarchical data partitioning strategy and an embedding training pipeline to optimize both communication and memory usage on a GPU cluster. With the decoupled design of CPU tasks (random walk) and GPU tasks (embedding training), our system is highly flexible and can fully utilize all computing resources on a GPU cluster. Comparing with the current state-of-the-art multi-GPU single-node embedding system, our system achieves 5.9x-14.4x speedup on average with competitive or better accuracy on open datasets. Using 40 NVIDIA V100 GPUs on a network with almost three hundred billion edges and more than one billion nodes, our implementation requires only 3 minutes to finish one training epoch.

下载PDF全文

下载文献需遵守相关版权规定

论文标题