论文标题

对GPU上的搭配分析进行深度学习培训

An Analysis of Collocation on GPUs for Deep Learning Training

论文作者

Robroek, Ties, Yousefzadeh-Asl-Miandoab, Ehsan, Tözün, Pınar

论文摘要

深度学习培训是一个昂贵的过程,可广泛使用GPU,但并非所有模型训练都饱和现代强大的GPU。 Multi-Insance GPU(MIG)是NVIDIA介绍的一项新技术,可以将GPU划分为更好地拟合的工作负载,这些工作负载不需要所有内存和计算完整GPU的资源。在本文中,我们在深度学习工作负载下研究了MIG启用A100 GPU的性能,其中包含各种尺寸和模型组合。我们将MIG的好处与GPU的较旧的工作负载搭配方法进行了对比:NAïvely在同一GPU上提交多个过程并利用多进程服务(MPS)。我们的结果表明,与多个模型培训运行相交可能会产生重大好处。在某些情况下,尽管时间增加了,但最多可以导致四倍的训练吞吐量。另一方面,并​​行训练的模型的总内存足迹和计算需求必须符合GPU的可用内存和计算资源。 MIG由于无干扰的分区而可能是有益的,尤其是当模型的大小与MIG分区选项保持一致时。但是,MIG的刚性分区可能会创建以最佳的GPU利用率,以用于更具动态的混合工作负载。通常,我们建议国会议员作为单个用户提交培训工作的模型培训的最佳性能和最灵活的搭配形式。

Deep learning training is an expensive process that extensively uses GPUs, but not all model training saturates modern powerful GPUs. Multi-Instance GPU (MIG) is a new technology introduced by NVIDIA that can partition a GPU to better-fit workloads that do not require all the memory and compute resources of a full GPU. In this paper, we examine the performance of a MIG-enabled A100 GPU under deep learning workloads containing various sizes and combinations of models. We contrast the benefits of MIG to older workload collocation methods on GPUs: naïvely submitting multiple processes on the same GPU and utilizing Multi-Process Service (MPS). Our results demonstrate that collocating multiple model training runs may yield significant benefits. In certain cases, it can lead up to four times training throughput despite increased epoch time. On the other hand, the aggregate memory footprint and compute needs of the models trained in parallel must fit the available memory and compute resources of the GPU. MIG can be beneficial thanks to its interference-free partitioning, especially when the sizes of the models align with the MIG partitioning options. MIG's rigid partitioning, however, may create sub-optimal GPU utilization for more dynamic mixed workloads. In general, we recommend MPS as the best performing and most flexible form of collocation for model training for a single user submitting training jobs.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源