使用迭代线性化了解深网中稀疏功能更新

论文标题

使用迭代线性化了解深网中稀疏功能更新

Understanding Sparse Feature Updates in Deep Networks using Iterative Linearisation

论文作者

Goldwaser, Adrian, Ge, Hong

论文摘要

尽管其能力提高了过度，但较大，更深的网络概括了。了解为什么发生这种情况在理论上和实际上很重要。最近的一种方法着眼于此类网络及其相应内核的无限范围。但是，这些理论工具无法完全解释有限网络，因为在基于梯度的训练中，与无限网络相比之下，经验内核在基于梯度的训练过程中发生了显着变化。在这项工作中，我们得出了一种迭代线性化训练方法，作为一种新型的经验工具，可以进一步研究这种区别，从而使我们能够控制稀疏（即罕见）功能更新并量化所需的特征学习频率，以实现相当的性能。我们证明迭代线性化是无限宽度方案的有限类似物（不学习特征的有限类似物与标准梯度下降训练）之间的合理性。非正式地，我们还表明它类似于Gauss-Newton算法的阻尼版本（一种二阶方法）。我们表明，在各种情况下，迭代线性化训练令人惊讶地在标准培训方面表现出色，特别是指出要实现可比性能需要多少频繁的功能学习。我们还表明，功能学习对于良好的表现至关重要。由于这种特征学习不可避免地会导致NTK内核的变化，因此我们为NTK理论提供了直接的负面证据，该理论指出NTK内核在训练过程中保持不变。

Larger and deeper networks generalise well despite their increased capacity to overfit. Understanding why this happens is theoretically and practically important. One recent approach looks at the infinitely wide limits of such networks and their corresponding kernels. However, these theoretical tools cannot fully explain finite networks as the empirical kernel changes significantly during gradient-descent-based training in contrast to infinite networks. In this work, we derive an iterative linearised training method as a novel empirical tool to further investigate this distinction, allowing us to control for sparse (i.e. infrequent) feature updates and quantify the frequency of feature learning needed to achieve comparable performance. We justify iterative linearisation as an interpolation between a finite analog of the infinite width regime, which does not learn features, and standard gradient descent training, which does. Informally, we also show that it is analogous to a damped version of the Gauss-Newton algorithm -- a second-order method. We show that in a variety of cases, iterative linearised training surprisingly performs on par with standard training, noting in particular how much less frequent feature learning is required to achieve comparable performance. We also show that feature learning is essential for good performance. Since such feature learning inevitably causes changes in the NTK kernel, we provide direct negative evidence for the NTK theory, which states the NTK kernel remains constant during training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题