论文标题

探索低资源机器翻译的背面翻译多样性

Exploring Diversity in Back Translation for Low-Resource Machine Translation

论文作者

Burchell, Laurie, Birch, Alexandra, Heafield, Kenneth

论文摘要

背面翻译是改善神经机器翻译系统性能的最广泛使用的方法之一。最近的研究试图通过增加生成翻译的“多样性”来提高该方法的有效性。我们认为,在以前的工作中用于量化“多样性”的定义和指标不足。这项工作提出了一个更细微的框架,以了解培训数据中的多样性,将其分为词汇多样性和句法多样性。我们提出了新颖的指标,用于衡量多样性的这些不同方面,并进行经验分析,以对这些类型的多样性对最终神经机器翻译模型的效果,用于低资源英语英语$ \ leftrightArrow $ thement $ \ leftrightarrow $ turkish $ turkish $ \ leftright $ \ leftrightightArrow $冰岛。我们的发现表明,使用核抽样产生背部翻译会导致更高的最终模型性能,并且这种生成方法具有较高的词汇和句法多样性。我们还发现证据表明,对于背部翻译表现,词汇多样性比句法更重要。

Back translation is one of the most widely used methods for improving the performance of neural machine translation systems. Recent research has sought to enhance the effectiveness of this method by increasing the 'diversity' of the generated translations. We argue that the definitions and metrics used to quantify 'diversity' in previous work have been insufficient. This work puts forward a more nuanced framework for understanding diversity in training data, splitting it into lexical diversity and syntactic diversity. We present novel metrics for measuring these different aspects of diversity and carry out empirical analysis into the effect of these types of diversity on final neural machine translation model performance for low-resource English$\leftrightarrow$Turkish and mid-resource English$\leftrightarrow$Icelandic. Our findings show that generating back translation using nucleus sampling results in higher final model performance, and that this method of generation has high levels of both lexical and syntactic diversity. We also find evidence that lexical diversity is more important than syntactic for back translation performance.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源