Straggler-Aware分布式学习：沟通计算潜伏期折衷

论文标题

Straggler-Aware分布式学习：沟通计算潜伏期折衷

Straggler-aware Distributed Learning: Communication Computation Latency Trade-off

论文作者

Ozfatura, Emre, Ulukus, Sennur, Gunduz, Deniz

论文摘要

当将梯度下降（GD）缩放到许多平行工人的大规模机器学习问题时，其每卷计算时间受散乱的工人的限制。可以通过跨数据和计算分配冗余计算和编码来容忍散乱的工人，但是在大多数现有方案中，每个迭代都会在完成所有计算后，每个迭代都会将一条消息传递给参数服务器（PS）。强加这种限制会导致两个主要缺点。由于不准确的预测散落行为，并且由于将工人视为Straggler/straggler和丢弃散乱者进行的部分计算而导致的过度发挥作用。在本文中，为了克服这些缺点，我们通过允许每个迭代中的每个工人传达多个计算，并相应地设计多个计算，并相应地考虑多个计算。然后，我们分析了如何有效地使用提出的设计来寻求计算和通信潜伏期之间的平衡，以最大程度地减少整体延迟。此外，通过对亚马逊EC2服务器上的基于模型和实际实现的广泛模拟，我们在不同的设置中确定了这些设计的优势和缺点，并证明MMC可以帮助改善现有的Straggler避免方案。

When gradient descent (GD) is scaled to many parallel workers for large scale machine learning problems, its per-iteration computation time is limited by the straggling workers. Straggling workers can be tolerated by assigning redundant computations and coding across data and computations, but in most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations. Imposing such a limitation results in two main drawbacks; over-computation due to inaccurate prediction of the straggling behaviour, and under-utilization due to treating workers as straggler/non-straggler and discarding partial computations carried out by stragglers. In this paper, to overcome these drawbacks, we consider multi-message communication (MMC) by allowing multiple computations to be conveyed from each worker per iteration, and design straggler avoidance techniques accordingly. Then, we analyze how the proposed designs can be employed efficiently to seek a balance between the computation and communication latency to minimize the overall latency. Furthermore, through extensive simulations, both model-based and real implementation on Amazon EC2 servers, we identify the advantages and disadvantages of these designs in different settings, and demonstrate that MMC can help improve upon existing straggler avoidance schemes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题