动量变压器：缩小自我注意力与线性化之间的性能差距

论文标题

动量变压器：缩小自我注意力与线性化之间的性能差距

Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization

论文作者

Nguyen, Tan, Baraniuk, Richard G., Kirby, Robert M., Osher, Stanley J., Wang, Bao

论文摘要

变形金刚在序列建模及以后取得了显着的成功，但是相对于输入序列的长度，二次计算和记忆复杂性遭受了损失。利用技术包括稀疏和线性的关注和哈希技巧；已经提出了有效的变压器来降低变压器的二次复杂性，但会显着降低准确性。作为响应，我们首先将计算注意图的线性注意力和残差连接解释为梯度下降步骤。然后，我们将动量引入这些组件，并提出\ emph {动量变压器}，该动量利用动量来提高线性变压器的准确性，同时保持线性内存和计算复杂性。此外，我们制定了一种自适应策略，以根据二次优化的最佳动量计算模型的动量值。这种自适应动量消除了寻找最佳动量值的需求，并进一步增强了动量变压器的性能。包括图像生成和机器翻译在内的自回旋和非自动回归任务的一系列实验表明，动量变压器在训练效率和准确性方面都优于流行的线性变压器。

Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accuracy. In response, we first interpret the linear attention and residual connections in computing the attention map as gradient descent steps. We then introduce momentum into these components and propose the \emph{momentum transformer}, which utilizes momentum to improve the accuracy of linear transformers while maintaining linear memory and computational complexities. Furthermore, we develop an adaptive strategy to compute the momentum value for our model based on the optimal momentum for quadratic optimization. This adaptive momentum eliminates the need to search for the optimal momentum value and further enhances the performance of the momentum transformer. A range of experiments on both autoregressive and non-autoregressive tasks, including image generation and machine translation, demonstrate that the momentum transformer outperforms popular linear transformers in training efficiency and accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题