随机改组：简单分析，并进行了广泛的改进

论文标题

随机改组：简单分析，并进行了广泛的改进

Random Reshuffling: Simple Analysis with Vast Improvements

论文作者

Mishchenko, Konstantin, Khaled, Ahmed, Richtárik, Peter

论文摘要

随机改组（RR）是一种算法，用于最大程度地减少有限和函数，该函数利用了迭代梯度下降步骤与数据改组结合使用。 RR通常与其兄弟姐妹随机梯度下降（SGD）形成鲜明对比，在实践中通常更快，并且在凸和非凸优化方面享有很大的普及。 RR的收敛速率最近引起了很大的关注，对于强烈的凸和光滑功能，显示出比SGD更快的收敛速度，如果1）步骤尺寸很小，2）梯度有界，3）3）时期的数量很大。我们删除了这3个假设，将对条件数量的依赖性从$κ^2 $提高到$κ$（分别从$κ$到$ \ \ \sqrtκ$），此外，还表明RR具有不同类型的方差。我们通过理论和实验认为，新方差类型为RR的出色性能提供了额外的理由。为了超越强大的凸度，我们为非严格凸和非凸目标提供了几个结果。我们表明，在所有情况下，我们的理论都会改善现有文献。最后，我们证明了乘坐算法（SO）算法的快速收敛，该算法在优化过程开始时仅将数据散步一次。我们对强烈凸的目标的理论与RR和SO的已知下限紧密匹配，并证实了一次或仅几次改组的常见实用启发式。作为我们分析的副产品，我们还获得了增量梯度算法（IG）的新结果，该算法根本不会散发数据。

Random Reshuffling (RR) is an algorithm for minimizing finite-sum functions that utilizes iterative gradient descent steps in conjunction with data reshuffling. Often contrasted with its sibling Stochastic Gradient Descent (SGD), RR is usually faster in practice and enjoys significant popularity in convex and non-convex optimization. The convergence rate of RR has attracted substantial attention recently and, for strongly convex and smooth functions, it was shown to converge faster than SGD if 1) the stepsize is small, 2) the gradients are bounded, and 3) the number of epochs is large. We remove these 3 assumptions, improve the dependence on the condition number from $κ^2$ to $κ$ (resp. from $κ$ to $\sqrtκ$) and, in addition, show that RR has a different type of variance. We argue through theory and experiments that the new variance type gives an additional justification of the superior performance of RR. To go beyond strong convexity, we present several results for non-strongly convex and non-convex objectives. We show that in all cases, our theory improves upon existing literature. Finally, we prove fast convergence of the Shuffle-Once (SO) algorithm, which shuffles the data only once, at the beginning of the optimization process. Our theory for strongly-convex objectives tightly matches the known lower bounds for both RR and SO and substantiates the common practical heuristic of shuffling once or only a few times. As a byproduct of our analysis, we also get new results for the Incremental Gradient algorithm (IG), which does not shuffle the data at all.

下载PDF全文

下载文献需遵守相关版权规定

论文标题