关于随机准Newton方法的深度学习效率

论文标题

关于随机准Newton方法的深度学习效率

On the efficiency of Stochastic Quasi-Newton Methods for Deep Learning

论文作者

Yousefi, Mahsa, Martinez, Angeles

论文摘要

虽然一阶方法是解决在大规模深度学习问题中出现的优化问题的流行，但它们带有一些急性缺陷。为了减少这种缺点，最近有兴趣应用二阶方法，例如基于准牛顿的方法，这些方法仅使用梯度信息来构建黑锡人的近似值。我们工作的主要重点是研究用于训练深神经网络的随机准牛顿算法的行为。我们已经分析了两个著名的准Newton更新的性能，有限的内存Broyden-Fletcher-Goldfarb-Shanno（BFGS）和对称排名一（SR1）的性能。这项研究填补了有关更新的真实性能的空白，并分析了使用更强大的BFGS更新或更便宜的SR1公式时是否获得了更高效的训练，该公式允许无限期的Hessian近似值进行近似值，因此可能有助于更好地导航在深度学习中发现的非核心损失功能中存在的病理学阶段。我们介绍并讨论了一项广泛的实验研究的结果，其中包括批处理标准化和网络架构的影响，有限的内存参数，批处理大小以及采样策略的类型。我们表明，随机的准牛顿优化器具有有效的效率，并且在某些情况下能够以其众多的超参数的最佳组合运行，在某些情况下能够胜过表现。

While first-order methods are popular for solving optimization problems that arise in large-scale deep learning problems, they come with some acute deficiencies. To diminish such shortcomings, there has been recent interest in applying second-order methods such as quasi-Newton based methods which construct Hessians approximations using only gradient information. The main focus of our work is to study the behaviour of stochastic quasi-Newton algorithms for training deep neural networks. We have analyzed the performance of two well-known quasi-Newton updates, the limited memory Broyden-Fletcher-Goldfarb-Shanno (BFGS) and the Symmetric Rank One (SR1). This study fills a gap concerning the real performance of both updates and analyzes whether more efficient training is obtained when using the more robust BFGS update or the cheaper SR1 formula which allows for indefinite Hessian approximations and thus can potentially help to better navigate the pathological saddle points present in the non-convex loss functions found in deep learning. We present and discuss the results of an extensive experimental study which includes the effect of batch normalization and network's architecture, the limited memory parameter, the batch size and the type of sampling strategy. we show that stochastic quasi-Newton optimizers are efficient and able to outperform in some instances the well-known first-order Adam optimizer run with the optimal combination of its numerous hyperparameters.

下载PDF全文

下载文献需遵守相关版权规定

论文标题