论文标题
两尾平均:随时随地,自适应,一次 - 同时,最佳重量平均以更好地概括
Two-Tailed Averaging: Anytime, Adaptive, Once-in-a-While Optimal Weight Averaging for Better Generalization
论文作者
论文摘要
通过从其计算中排除了许多随机优化的领先迭代,对Polyak平均非反应行为的平均尾巴进行了改善。实际上,由于有限数量的优化步骤和无法将其退火至零的学习率,尾巴平均可以比单个迭代或Polyak平均值更接近训练损失的局部最小点。但是,引导迭代的数量是重要的超参数,并且开始平均太早或太晚导致资源或次优的解决方案的使用效率低下。我们的工作着重于改善概括,这使得将这种超级参数设置得更困难,尤其是在其他超参数和过度拟合的情况下。此外,在平均开始之前,损失只是对最终表现的淡淡信息,这使得早期停止不可靠。为了减轻这些问题,我们提出了一个旨在改善概括而不是纯正优化的尾巴的任何时间变体,该变体没有超级参数,并且在所有优化步骤中都可以近似最佳的尾巴。我们的算法基于两个跑步平均值,其自适应长度以最佳的尾巴长度为界,其中一个具有一定规律性的近似最佳性。仅需要两组权重的额外存储空间和对损失的定期评估,提出的两尾平均算法是一种实用且广泛适用的方法,可改善概括。
Tail Averaging improves on Polyak averaging's non-asymptotic behaviour by excluding a number of leading iterates of stochastic optimization from its calculations. In practice, with a finite number of optimization steps and a learning rate that cannot be annealed to zero, Tail Averaging can get much closer to a local minimum point of the training loss than either the individual iterates or the Polyak average. However, the number of leading iterates to ignore is an important hyperparameter, and starting averaging too early or too late leads to inefficient use of resources or suboptimal solutions. Our work focusses on improving generalization, which makes setting this hyperparameter even more difficult, especially in the presence of other hyperparameters and overfitting. Furthermore, before averaging starts, the loss is only weakly informative of the final performance, which makes early stopping unreliable. To alleviate these problems, we propose an anytime variant of Tail Averaging intended for improving generalization not pure optimization, that has no hyperparameters and approximates the optimal tail at all optimization steps. Our algorithm is based on two running averages with adaptive lengths bounded in terms of the optimal tail length, one of which achieves approximate optimality with some regularity. Requiring only the additional storage for two sets of weights and periodic evaluation of the loss, the proposed Two-Tailed Averaging algorithm is a practical and widely applicable method for improving generalization.