多节点BERT预测：具有成本效益的方法

论文标题

多节点BERT预测：具有成本效益的方法

Multi-node Bert-pretraining: Cost-efficient Approach

论文作者

Lin, Jiahuang, Li, Xin, Pekhimenko, Gennady

论文摘要

最近，基于大规模变压器的语言模型，例如BERT，GPT-2和XLNET，为许多自然语言处理（NLP）任务带来了令人兴奋的最新结果。这些最近模型的常见趋势之一是模型复杂性的显着提高，这既引入了权重和计算。此外，随着大规模无监督数据集的出现，由于单个培训时期的数据样本量增加，培训时间得到了进一步延长。结果，为了在合理的时间内训练这些模型，机器学习（ML）程序员通常需要高级硬件设置，例如高级启用GPU的NVIDIA DGX工作站或专用加速器，例如Google的TPU Pods。我们的工作解决了这一限制，并证明可以在2周内通过仔细的算法和软件优化对BERT预训练的模型进行培训。在本文中，我们介绍了有关如何改善单个设备训练吞吐量，通过多个节点和GPU分配培训工作量的这些优化，并克服网络上大型数据交换引入的通信瓶颈。我们表明，我们能够在合理的时间预算（12天）内对BERT进行预培训，但是比以前基于NVIDIA DGX机器或Google的TPU Pods在先前证明的工业环境中，价格便宜得多且攻击性的硬件资源需求。

Recently, large scale Transformer-based language models such as BERT, GPT-2, and XLNet have brought about exciting leaps in state-of-the-art results for many Natural Language Processing (NLP) tasks. One of the common trends in these recent models is a significant increase in model complexity, which introduces both more weights and computation. Moreover, with the advent of large-scale unsupervised datasets, training time is further extended due to the increased amount of data samples within a single training epoch. As a result, to train these models within a reasonable time, machine learning (ML) programmers often require advanced hardware setups such as the premium GPU-enabled NVIDIA DGX workstations or specialized accelerators such as Google's TPU Pods. Our work addresses this limitation and demonstrates that the BERT pre-trained model can be trained within 2 weeks on an academic-size cluster of widely available GPUs through careful algorithmic and software optimizations. In this paper, we present these optimizations on how to improve single device training throughput, distribute the training workload over multiple nodes and GPUs, and overcome the communication bottleneck introduced by the large data exchanges over the network. We show that we are able to perform pre-training on BERT within a reasonable time budget (12 days) in an academic setting, but with a much less expensive and less aggressive hardware resource requirement than in previously demonstrated industrial settings based on NVIDIA DGX machines or Google's TPU Pods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题