论文标题
培训自动释放措施的有效神经句子编码
Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases
论文作者
论文摘要
句子嵌入通常用于文本聚类和语义检索任务中。最先进的句子表示方法基于大量手动标记句子对集合的人工神经网络。高资源语言(例如英语或中文)可以使用足够数量的带注释数据。在不太受欢迎的语言中,必须使用多语言模型,从而提供较低的性能。在本出版物中,我们通过提出一种训练有效的语言特定句子编码的方法,而无需手动标记数据来解决此问题。我们的方法是从句子对准双语文本语料库中自动构建释义对的数据集。然后,我们使用收集到的数据将带有附加复发池层的变压器语言模型微调。我们的句子编码器可以在不到一天的时间内在一张图形卡上进行培训,从而在各种句子级别的任务上实现了高性能。我们在波兰语中评估了八个语言任务的方法,并将其与最佳可用多语言句子编码器进行比较。
Sentence embeddings are commonly used in text clustering and semantic retrieval tasks. State-of-the-art sentence representation methods are based on artificial neural networks fine-tuned on large collections of manually labeled sentence pairs. Sufficient amount of annotated data is available for high-resource languages such as English or Chinese. In less popular languages, multilingual models have to be used, which offer lower performance. In this publication, we address this problem by proposing a method for training effective language-specific sentence encoders without manually labeled data. Our approach is to automatically construct a dataset of paraphrase pairs from sentence-aligned bilingual text corpora. We then use the collected data to fine-tune a Transformer language model with an additional recurrent pooling layer. Our sentence encoder can be trained in less than a day on a single graphics card, achieving high performance on a diverse set of sentence-level tasks. We evaluate our method on eight linguistic tasks in Polish, comparing it with the best available multilingual sentence encoders.