论文标题
复发记忆变压器
Recurrent Memory Transformer
论文作者
论文摘要
基于变压器的模型在多个领域和任务上显示了它们的有效性。自我注意力允许将所有序列元素的信息结合到上下文感知表示形式中。但是,全球和本地信息必须主要存储在同一元素表示中。此外,输入序列的长度受到自我注意的二次计算复杂性的限制。 在这项工作中,我们提出并研究了一个记忆启动的片段级反复变压器(RMT)。内存允许借助复发来存储和处理本地和全局信息,并可以在长序列的段之间传递信息。 我们通过在输入或输出序列中添加特殊的内存令牌,实现了一个内存机制,没有对变压器模型进行更改。然后训练模型以控制内存操作和序列表示处理。 实验的结果表明,RMT在针对较小内存大小的语言建模上与Transformer-XL相同,并且在需要更长序列处理的任务中胜过它。我们表明,将内存令牌添加到TR-XL可以提高其性能。这使得回复的内存变压器成为需要学习长期依赖性和内存处理中的通用性(例如算法任务和推理)的应用程序的有前途的体系结构。
Transformer-based models show their effectiveness across multiple domains and tasks. The self-attention allows to combine information from all sequence elements into context-aware representations. However, global and local information has to be stored mostly in the same element-wise representations. Moreover, the length of an input sequence is limited by quadratic computational complexity of self-attention. In this work, we propose and study a memory-augmented segment-level recurrent Transformer (RMT). Memory allows to store and process local and global information as well as to pass information between segments of the long sequence with the help of recurrence. We implement a memory mechanism with no changes to Transformer model by adding special memory tokens to the input or output sequence. Then the model is trained to control both memory operations and sequence representations processing. Results of experiments show that RMT performs on par with the Transformer-XL on language modeling for smaller memory sizes and outperforms it for tasks that require longer sequence processing. We show that adding memory tokens to Tr-XL is able to improve its performance. This makes Recurrent Memory Transformer a promising architecture for applications that require learning of long-term dependencies and general purpose in memory processing, such as algorithmic tasks and reasoning.