深速数据效率：通过有效的数据采样和路由提高深度学习模型质量和培训效率

论文标题

深速数据效率：通过有效的数据采样和路由提高深度学习模型质量和培训效率

DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing

论文作者

Li, Conglong, Yao, Zhewei, Wu, Xiaoxia, Zhang, Minjia, Holmes, Connor, Li, Cheng, He, Yuxiong

论文摘要

深度学习模型的最新进展以强大的培训成本为代价。增加的模型大小是根本原因之一，但是另一个不太强调的事实是，数据量表实际上以与模型量表相似的速度增加，并且训练成本与两者成正比。与快速发展的模型体系结构相比，如何有效地使用培训数据（尤其是对于昂贵的基础模型预处理）既没有探索又难以实现，因为缺乏方便的框架，该框架着重于数据效率能力。为此，我们提出了深速数据效率，该框架可以更好地利用数据，提高培训效率并提高模型质量。具体而言，我们提出并结合了两种数据效率技术：通过一般课程学习库进行有效的数据采样，以及通过新颖的随机数据路由的有效数据路由。对于GPT-3 1.3b语言模型进行预处理，我们的工作减少了12.5倍的数据/时间/成本（\ $ 3.7K，如果在Azure租金），同时仍然保持95％的型号质量，而基线则具有完整的数据和成本（\ $ 46.3k）。对于GPT-3 1.3B和BERT-LARGE预审计，我们的工作也可以达到相同的模型质量，而数据/时间/成本却降低了2倍，或者在相同的数据/时间/成本下实现了更好的模型质量。 DeepSpeed数据效率易于使用和调整，使我们能够轻松地应用它，并在包括GPT-3 MOE型号和小规模的GPT-2/VIT FINETUNENEN（包括GPT-3 MOE模型）上验证其好处。

Recent advances on deep learning models come at the price of formidable training cost. The increasing model size is one of the root causes, but another less-emphasized fact is that data scale is actually increasing at a similar speed as model scale, and the training cost is proportional to both of them. Compared to the rapidly evolving model architecture, how to efficiently use the training data (especially for the expensive foundation model pretraining) is both less explored and difficult to realize due to the lack of a convenient framework that focuses on data efficiency capabilities. To this end, we present DeepSpeed Data Efficiency, a framework that makes better use of data, increases training efficiency, and improves model quality. Specifically, we propose and combine two data efficiency techniques: efficient data sampling via a general curriculum learning library, and efficient data routing via a novel random layerwise token dropping technique. For GPT-3 1.3B language model pretraining, our work achieves 12.5x less data/time/cost (\$3.7K if rent on Azure), while still maintaining 95% of model quality compared to baseline with full data and cost (\$46.3K). For GPT-3 1.3B and BERT-large pretraining, our work can also achieve the same model quality with up to 2x less data/time/cost, or achieve better model quality under same data/time/cost. DeepSpeed Data Efficiency is easy to use and tune, enabling us to easily apply it and verify its benefit on additional tasks including GPT-3 MoE model pretraining and small-scale GPT-2/ViT finetuning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题