改革者：高效的变压器

论文标题

改革者：高效的变压器

Reformer: The Efficient Transformer

论文作者

Kitaev, Nikita, Kaiser, Łukasz, Levskaya, Anselm

论文摘要

大型变压器模型通常在许多任务上实现最新的结果，但是训练这些模型的成本可能会高昂，尤其是在长序列上。我们介绍了两种提高变压器效率的技术。首先，我们用使用局部敏感的哈希（Hoshing）将DOT-Product Ration替换为O（$ l^2 $）变为O（$ L \ log L $），其中$ l $是序列的长度。此外，我们使用可逆的残差层而不是标准残差，该层仅允许在训练过程中存储一次激活，而不是$ n $ times，其中$ n $是层的数量。最终的模型，改革仪与变压器模型相同，同时更具记忆效率，并且在长序列上更快。

Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L\log L$), where $L$ is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of $N$ times, where $N$ is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.

下载PDF全文

下载文献需遵守相关版权规定

论文标题