使用平行语料库学习上下文化的跨语性单词嵌入和对齐方式的对齐方式

论文标题

使用平行语料库学习上下文化的跨语性单词嵌入和对齐方式的对齐方式

Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora

论文作者

Wada, Takashi, Iwata, Tomoharu, Matsumoto, Yuji, Baldwin, Timothy, Lau, Jey Han

论文摘要

我们提出了一种基于小的平行语料库（例如几百个句子对）学习上下文化的跨语性词嵌入方法的新方法。我们的方法通过LSTM编码器模型获得单词嵌入，该模型同时翻译和重建输入句子。通过在不同语言之间共享模型参数，我们的模型可以在通用的跨语性空间中共同训练嵌入一词。我们还建议将单词和子字嵌入结合使用，以利用不同语言的拼字法相似性。我们将实验基于濒危语言的实际数据，即Yongning NA，Shipibo-Konibo和Griko。我们关于双语词典归纳和单词对齐任务的实验表明，对于大多数语言对，我们的模型比现有方法的优于现有方法。这些结果表明，与共同的信念相反，即使在极低的资源条件下，编码器折叠翻译模型也有益于学习跨语性表示。此外，我们的模型在高资源条件下也很好地运作，在德语英语单词分配任务上实现了最先进的表现。

We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus (e.g. a few hundred sentence pairs). Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence. Through sharing model parameters among different languages, our model jointly trains the word embeddings in a common cross-lingual space. We also propose to combine word and subword embeddings to make use of orthographic similarities across different languages. We base our experiments on real-world data from endangered languages, namely Yongning Na, Shipibo-Konibo, and Griko. Our experiments on bilingual lexicon induction and word alignment tasks show that our model outperforms existing methods by a large margin for most language pairs. These results demonstrate that, contrary to common belief, an encoder-decoder translation model is beneficial for learning cross-lingual representations even in extremely low-resource conditions. Furthermore, our model also works well on high-resource conditions, achieving state-of-the-art performance on a German-English word-alignment task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题