将自适应梯度方法从学习率中解开

论文标题

将自适应梯度方法从学习率中解开

Disentangling Adaptive Gradient Methods from Learning Rates

论文作者

Agarwal, Naman, Anil, Rohan, Hazan, Elad, Koren, Tomer, Zhang, Cyril

论文摘要

我们在评估深度学习的优化算法中研究了几个混杂因素。首先，我们更深入地了解自适应梯度方法如何与学习率计划相互作用，这是一种臭名昭著的难以调整的超参数，对神经网络培训的收敛和概括产生了巨大影响。我们介绍了一个“嫁接”实验，该实验将更新的大小与其方向相关，发现许多现有文献中的现有信念可能是由于不足以隔离阶梯尺寸的隐性时间表而产生的。除了这一贡献之外，我们还提出了一些有关自适应梯度方法概括的经验和理论回顾，旨在使这个空间更加清晰。

We investigate several confounding factors in the evaluation of optimization algorithms for deep learning. Primarily, we take a deeper look at how adaptive gradient methods interact with the learning rate schedule, a notoriously difficult-to-tune hyperparameter which has dramatic effects on the convergence and generalization of neural network training. We introduce a "grafting" experiment which decouples an update's magnitude from its direction, finding that many existing beliefs in the literature may have arisen from insufficient isolation of the implicit schedule of step sizes. Alongside this contribution, we present some empirical and theoretical retrospectives on the generalization of adaptive gradient methods, aimed at bringing more clarity to this space.

下载PDF全文

下载文献需遵守相关版权规定

论文标题