TensorCoder：通过张量表示自然语言建模的张量表示尺寸的关注

论文标题

TensorCoder：通过张量表示自然语言建模的张量表示尺寸的关注

TensorCoder: Dimension-Wise Attention via Tensor Representation for Natural Language Modeling

论文作者

Zhang, Shuai, Zhang, Peng, Ma, Xindian, Wei, Junqiu, Wang, Ningning, Liu, Qun

论文摘要

变压器在许多自然语言处理（NLP）任务中已被广泛使用，而令牌之间的缩放点产量关注是变压器的核心模块。这种关注是一个象征性的设计，其复杂性与序列的长度相二次，从而限制了其对长序列任务的应用潜力。在本文中，我们提出了一种基于尺寸的注意机制，基于新型语言建模方法（即可以开发出新的语言建模方法）。范围的注意力可以将注意力复杂性从原始$ O（n^2d）$（nd^2）$降低，其中$ n $是序列的长度，而$ d $是头的维度。我们在两个任务中验证TensorCoder，包括蒙版语言建模和神经机器翻译。与原始变压器相比，TensorCoder不仅大大降低了原始模型的计算，而且还获得了蒙版语言建模任务（在PTB数据集）和机器翻译任务上可比性能的改进性能。

Transformer has been widely-used in many Natural Language Processing (NLP) tasks and the scaled dot-product attention between tokens is a core module of Transformer. This attention is a token-wise design and its complexity is quadratic to the length of sequence, limiting its application potential for long sequence tasks. In this paper, we propose a dimension-wise attention mechanism based on which a novel language modeling approach (namely TensorCoder) can be developed. The dimension-wise attention can reduce the attention complexity from the original $O(N^2d)$ to $O(Nd^2)$, where $N$ is the length of the sequence and $d$ is the dimensionality of head. We verify TensorCoder on two tasks including masked language modeling and neural machine translation. Compared with the original Transformer, TensorCoder not only greatly reduces the calculation of the original model but also obtains improved performance on masked language modeling task (in PTB dataset) and comparable performance on machine translation tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题