优化深度学习建议系统对CPU群集体系结构的培训

论文标题

优化深度学习建议系统对CPU群集体系结构的培训

Optimizing Deep Learning Recommender Systems' Training On CPU Cluster Architectures

论文作者

Kalamkar, Dhiraj, Georganas, Evangelos, Srinivasan, Sudarshan, Chen, Jianping, Shiryaev, Mikhail, Heinecke, Alexander

论文摘要

在过去的两年中，许多研究人员的目标是将HPC系统中的最后一部分进行AI任务。通常，讨论是在训练速度的速度训练的背景下进行的。不幸的是，RESNET50不再是2020年的代表性工作量。因此，我们专注于推荐系统，这些系统是云计算中心中大多数AI周期的推荐系统。更具体地说，我们专注于Facebook的DLRM基准。通过使其能够在最新的CPU硬件和针对HPC量身定制的软件上运行，我们能够在单个插座上的性能（110倍）的两种数量级以上，与参考CPU实现相比，高缩放效率高达64个套筒，同时适合超级大数据集。本文讨论了DLRM中各种操作员的优化技术，以及这些不同运算符强调系统的哪个组件。提出的技术适用于与DLM相同的缩放挑战/特征的更广泛的DL工作负载。

During the last two years, the goal of many researchers has been to squeeze the last bit of performance out of HPC system for AI tasks. Often this discussion is held in the context of how fast ResNet50 can be trained. Unfortunately, ResNet50 is no longer a representative workload in 2020. Thus, we focus on Recommender Systems which account for most of the AI cycles in cloud computing centers. More specifically, we focus on Facebook's DLRM benchmark. By enabling it to run on latest CPU hardware and software tailored for HPC, we are able to achieve more than two-orders of magnitude improvement in performance (110x) on a single socket compared to the reference CPU implementation, and high scaling efficiency up to 64 sockets, while fitting ultra-large datasets. This paper discusses the optimization techniques for the various operators in DLRM and which component of the systems are stressed by these different operators. The presented techniques are applicable to a broader set of DL workloads that pose the same scaling challenges/characteristics as DLRM.

下载PDF全文

下载文献需遵守相关版权规定

论文标题