论文标题
dict-MLM:使用双语词典改进了多语言的预训练
DICT-MLM: Improved Multilingual Pre-Training using Bilingual Dictionaries
论文作者
论文摘要
预先训练的多语言模型(例如Mbert)显示了几种自然语言处理(NLP)任务的巨大收益,尤其是在零摄像的跨语言环境中。这些预训练的模型中的大多数(如果不是全部)都依赖于蒙版语言建模(MLM)作为关键语言学习目标。这些方法背后的原则是,借助于周围的文本来预测蒙面的单词有助于学习有效的上下文化表示。尽管MLM具有强大的表示能力,但我们证明了MLM的固有限制用于多语言表示学习。特别是,通过要求模型预测特定于语言的令牌,传销目标使学习语言不可能的表示形式不利 - 这是多语言预训练的关键目标。因此,为了鼓励更好的跨语性表示学习,我们提出了dict-mlm方法。 DICS-MLM通过激励该模型不仅能够预测原始的蒙版单词,而且还可能进行跨语性同义词来进行工作。我们对跨越30多种语言的多个下游任务的经验分析,证明了所提出的方法的功效及其学习更好的多语言表示的能力。
Pre-trained multilingual language models such as mBERT have shown immense gains for several natural language processing (NLP) tasks, especially in the zero-shot cross-lingual setting. Most, if not all, of these pre-trained models rely on the masked-language modeling (MLM) objective as the key language learning objective. The principle behind these approaches is that predicting the masked words with the help of the surrounding text helps learn potent contextualized representations. Despite the strong representation learning capability enabled by MLM, we demonstrate an inherent limitation of MLM for multilingual representation learning. In particular, by requiring the model to predict the language-specific token, the MLM objective disincentivizes learning a language-agnostic representation -- which is a key goal of multilingual pre-training. Therefore to encourage better cross-lingual representation learning we propose the DICT-MLM method. DICT-MLM works by incentivizing the model to be able to predict not just the original masked word, but potentially any of its cross-lingual synonyms as well. Our empirical analysis on multiple downstream tasks spanning 30+ languages, demonstrates the efficacy of the proposed approach and its ability to learn better multilingual representations.