论文标题
任何时候Minibatch:在线分布式优化中利用Stragglers
Anytime MiniBatch: Exploiting Stragglers in Online Distributed Optimization
论文作者
论文摘要
分布式优化对于解决大型机器学习问题至关重要。分布式优化技术的一个广泛共享的功能是,在系统可以继续进行下一个时代之前,所有节点都需要在每个计算时期内完成分配的任务。在这种情况下,慢节点(称为Stragglers)可以大大减慢进度。为了减轻Stragglers的影响,我们提出了一种在线分布式优化方法,称为Anytime Minibatch。在这种方法中,所有节点都有一个固定的时间来计算尽可能多的数据样本的梯度。结果是可变的每节点Minibatch大小。然后,工人获得固定的通信时间,通过几轮共识平均他们的Minibatch梯度,然后将其用于通过双平均值来更新原始变量。每当Minibatch都可以防止Stragglers坚持系统,而不会浪费Stragglers可以完成的工作。我们提供收敛分析并分析墙壁时间性能。我们的数值结果表明,在亚马逊EC2中,我们的方法要快1.5倍,并且当计算节点性能的变化较大时,它的速度要快五倍。
Distributed optimization is vital in solving large-scale machine learning problems. A widely-shared feature of distributed optimization techniques is the requirement that all nodes complete their assigned tasks in each computational epoch before the system can proceed to the next epoch. In such settings, slow nodes, called stragglers, can greatly slow progress. To mitigate the impact of stragglers, we propose an online distributed optimization method called Anytime Minibatch. In this approach, all nodes are given a fixed time to compute the gradients of as many data samples as possible. The result is a variable per-node minibatch size. Workers then get a fixed communication time to average their minibatch gradients via several rounds of consensus, which are then used to update primal variables via dual averaging. Anytime Minibatch prevents stragglers from holding up the system without wasting the work that stragglers can complete. We present a convergence analysis and analyze the wall time performance. Our numerical results show that our approach is up to 1.5 times faster in Amazon EC2 and it is up to five times faster when there is greater variability in compute node performance.