自适应梯度方法通过过度参数化更快地收敛（但您应该进行线路搜索）

论文标题

自适应梯度方法通过过度参数化更快地收敛（但您应该进行线路搜索）

Adaptive Gradient Methods Converge Faster with Over-Parameterization (but you should do a line-search)

论文作者

Vaswani, Sharan, Laradji, Issam, Kunstner, Frederik, Meng, Si Yi, Schmidt, Mark, Lacoste-Julien, Simon

论文摘要

自适应梯度方法通常用于训练过度参数化模型。为了更好地理解其行为，我们研究了一个简单的环境 - 平稳，凸出的损失，模型过度参数足以插入数据。在这种情况下，我们证明，Amsgrad持续不断的踏板和动量以更快的$ O（1/t）$ rate收敛到最小化器。当插值仅被近似满足时，恒定的阶梯尺寸AMSGRAD以相同的速率收敛到解决方案的邻域，而Adagrad则可以强大地违反插值。但是，即使对于满足插值的简单凸问题，这两种方法的经验性能都在很大程度上取决于步骤尺寸，并且需要调整，从而质疑它们的适应性。我们通过使用随机线路搜索或Polyak步骤尺寸自动确定步骤大小来缓解此问题。通过这些技术，我们证明Adagrad和Amsgrad都保留了它们的收敛保证，而无需知道问题依赖性常数。从经验上讲，我们证明这些技术可以改善任务跨任务的自适应梯度方法的收敛和概括，从二进制分类和内核映射到具有深网的多类分类。

Adaptive gradient methods are typically used for training over-parameterized models. To better understand their behaviour, we study a simplistic setting -- smooth, convex losses with models over-parameterized enough to interpolate the data. In this setting, we prove that AMSGrad with constant step-size and momentum converges to the minimizer at a faster $O(1/T)$ rate. When interpolation is only approximately satisfied, constant step-size AMSGrad converges to a neighbourhood of the solution at the same rate, while AdaGrad is robust to the violation of interpolation. However, even for simple convex problems satisfying interpolation, the empirical performance of both methods heavily depends on the step-size and requires tuning, questioning their adaptivity. We alleviate this problem by automatically determining the step-size using stochastic line-search or Polyak step-sizes. With these techniques, we prove that both AdaGrad and AMSGrad retain their convergence guarantees, without needing to know problem-dependent constants. Empirically, we demonstrate that these techniques improve the convergence and generalization of adaptive gradient methods across tasks, from binary classification with kernel mappings to multi-class classification with deep networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题