论文标题
关于训练DNN的随机通用高斯牛顿法的承诺
On the Promise of the Stochastic Generalized Gauss-Newton Method for Training DNNs
论文作者
论文摘要
在早期研究无黑森的深度学习方法之后,我们研究了一种训练DNN的随机通用高斯方法(SGN)。 SGN是一种具有有效迭代的二阶优化方法,我们证明,与标准SGD相比,我们证明的迭代通常要少得多。顾名思义,SGN使用高斯牛顿矩阵使用高斯 - 纽顿近似,并且为了计算近似搜索方向,依赖于共轭梯度方法与前进和反向自动分化相结合。尽管SGD及其一阶变体取得了成功,并且尽管基于高斯 - 纽顿·黑森斯近似的无HESSIAN方法已经被理论上提出了作为训练DNN的实用方法,但我们相信SGN在大型迷你批次中没有很多未发现的尚未发现的,但在大型迷你批量方案中却没有得到满足。在这种情况下,我们证明SGN不仅在迭代次数方面确实相当于SGD,而且在运行时也有所改善。我们在Theano深度学习平台中提出的SGN实施的高效,易于使用和灵活的实现使这成为可能,该平台与Tensorflow和Pytorch不同,它支持远期自动差异化。这使研究人员能够进一步研究和改进这种有希望的优化技术,并希望重新考虑随机二阶方法作为培训DNN的竞争优化技术;我们还希望SGN的承诺可能会导致向前自动差异添加到Tensorflow或Pytorch中。我们的结果还表明,在大型迷你批次方案中,SGN相对于其超参数而言,SGN比SGD更强大(我们从来不必为我们的基准测试!),这简化了昂贵的超参数调谐过程,而代替对第一阶方法的性能至关重要。
Following early work on Hessian-free methods for deep learning, we study a stochastic generalized Gauss-Newton method (SGN) for training DNNs. SGN is a second-order optimization method, with efficient iterations, that we demonstrate to often require substantially fewer iterations than standard SGD to converge. As the name suggests, SGN uses a Gauss-Newton approximation for the Hessian matrix, and, in order to compute an approximate search direction, relies on the conjugate gradient method combined with forward and reverse automatic differentiation. Despite the success of SGD and its first-order variants, and despite Hessian-free methods based on the Gauss-Newton Hessian approximation having been already theoretically proposed as practical methods for training DNNs, we believe that SGN has a lot of undiscovered and yet not fully displayed potential in big mini-batch scenarios. For this setting, we demonstrate that SGN does not only substantially improve over SGD in terms of the number of iterations, but also in terms of runtime. This is made possible by an efficient, easy-to-use and flexible implementation of SGN we propose in the Theano deep learning platform, which, unlike Tensorflow and Pytorch, supports forward automatic differentiation. This enables researchers to further study and improve this promising optimization technique and hopefully reconsider stochastic second-order methods as competitive optimization techniques for training DNNs; we also hope that the promise of SGN may lead to forward automatic differentiation being added to Tensorflow or Pytorch. Our results also show that in big mini-batch scenarios SGN is more robust than SGD with respect to its hyperparameters (we never had to tune its step-size for our benchmarks!), which eases the expensive process of hyperparameter tuning that is instead crucial for the performance of first-order methods.