论文标题
N-Strammer:带有潜在n-grams的增强变压器
N-Grammer: Augmenting Transformers with latent n-grams
论文作者
论文摘要
变压器模型最近已成为自然语言处理中的基础模型之一,作为副产品,最近对扩展这些模型有巨大的兴趣和投资。但是,这些大型变压器语言模型的培训和推理成本令人难以置信,因此需要更多的研究来识别更有效的变体。在这项工作中,我们通过用统计语言建模中文献启发的变压器体系结构提出了一个简单而有效的修改,该构建通过使用文本序列的离散潜在表示构建的n-grams来增强模型。我们评估了我们的模型,关于C4数据集的语言建模的N-Strammer以及Superglue数据集的文本分类,并发现它的表现优于诸如变压器和底漆等几个强大的基线。我们为JAX中的可重复性开放模型。
Transformer models have recently emerged as one of the foundational models in natural language processing, and as a byproduct, there is significant recent interest and investment in scaling these models. However, the training and inference costs of these large Transformer language models are prohibitive, thus necessitating more research in identifying more efficient variants. In this work, we propose a simple yet effective modification to the Transformer architecture inspired by the literature in statistical language modeling, by augmenting the model with n-grams that are constructed from a discrete latent representation of the text sequence. We evaluate our model, the N-Grammer on language modeling on the C4 data-set as well as text classification on the SuperGLUE data-set, and find that it outperforms several strong baselines such as the Transformer and the Primer. We open-source our model for reproducibility purposes in Jax.