具有弹性和动力的自适应梯度方法

论文标题

具有弹性和动力的自适应梯度方法

Adaptive Gradient Method with Resilience and Momentum

论文作者

Liu, Jie, Lin, Chen, Li, Chuming, Sheng, Lu, Sun, Ming, Yan, Junjie, Ouyang, Wanli

论文摘要

在训练深层神经网络时，已经提出了几种随机梯度下降（SGD）的变体，以提高学习效率和效率，其中最近有影响力的尝试希望适应性地控制参数的学习率（例如Adam和Rmsprop）。尽管与SGD相比，大多数自适应学习速度方法的概括都受到损害，但它们的收敛速度有很大的提高。在本文中，我们提出了一种具有韧性和动量（ADAREM）的自适应梯度方法，该方法是由网络参数的振荡放慢训练并提供理论上收敛证明的观察到的。对于每个参数，Adarem根据一个参数的方向是否与当前梯度的方向对齐，并根据当前梯度的方向对齐参数学习率，从而鼓励长期一致的参数更新，而振荡较少。在大规模图像识别数据集上训练各种模型时，已经进行了全面的实验来验证Adarem的有效性，例如Imagenet，这也表明，我们的方法在训练速度和测试错误方面分别优于以前的基于自适应学习率的算法。

Several variants of stochastic gradient descent (SGD) have been proposed to improve the learning effectiveness and efficiency when training deep neural networks, among which some recent influential attempts would like to adaptively control the parameter-wise learning rate (e.g., Adam and RMSProp). Although they show a large improvement in convergence speed, most adaptive learning rate methods suffer from compromised generalization compared with SGD. In this paper, we proposed an Adaptive Gradient Method with Resilience and Momentum (AdaRem), motivated by the observation that the oscillations of network parameters slow the training, and give a theoretical proof of convergence. For each parameter, AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient, and thus encourages long-term consistent parameter updating with much fewer oscillations. Comprehensive experiments have been conducted to verify the effectiveness of AdaRem when training various models on a large-scale image recognition dataset, e.g., ImageNet, which also demonstrate that our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题