论文标题
OSDP:与分布式深度学习平行的最佳碎片数据
OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning
论文作者
论文摘要
大规模的深度学习模型有助于对下游任务的品种进行重大的绩效改进。当前的数据和模型并行性方法利用模型复制和分区技术来支持超大模型的分布式训练。但是,由于复杂的模型体系结构和严格的设备内存约束,直接部署这些系统通常会导致次优训练效率。在本文中,我们提出了最佳的碎片数据并行(OSDP),这是一种自动平行训练系统,结合了数据和模型并行性的优势。给定模型描述和设备信息,OSDP在内存消耗和硬件利用率之间进行权衡,因此自动生成分布式计算图并最大化整体系统吞吐量。此外,OSDP还引入了操作员分裂,以进一步减轻培训期间的峰值记忆足迹,这可以忽略不计,这可以使较大型号和更高的吞吐量的训练性。 OSDP在多种不同类型的大规模模型上的广泛实验结果表明,所提出的策略在多个方面都优于最先进的策略。我们的代码可在https://github.com/youhe-jiang/optimalshardeddataparallel上获得。
Large-scale deep learning models contribute to significant performance improvements on varieties of downstream tasks. Current data and model parallelism approaches utilize model replication and partition techniques to support the distributed training of ultra-large models. However, directly deploying these systems often leads to sub-optimal training efficiency due to the complex model architectures and the strict device memory constraints. In this paper, we propose Optimal Sharded Data Parallel (OSDP), an automated parallel training system that combines the advantages from both data and model parallelism. Given the model description and the device information, OSDP makes trade-offs between the memory consumption and the hardware utilization, thus automatically generates the distributed computation graph and maximizes the overall system throughput. In addition, OSDP introduces operator splitting to further alleviate peak memory footprints during training with negligible overheads, which enables the trainability of larger models as well as the higher throughput. Extensive experimental results of OSDP on multiple different kinds of large-scale models demonstrate that the proposed strategy outperforms the state-of-the-art in multiple regards. Our code is available at https://github.com/Youhe-Jiang/OptimalShardedDataParallel.