论文标题
AR:自动修复神经机器翻译的合成数据
AR: Auto-Repair the Synthetic Data for Neural Machine Translation
论文作者
论文摘要
与仅使用有限的真实平行数据作为培训语料库相比,许多研究证明,合成平行数据(由背面翻译(BT)或正向翻译(FT或自我训练)生成的合成平行数据都可以显着提高翻译质量。但是,作为众所周知的缺点,合成的并行数据是嘈杂的,因为它们是由不完善的NMT系统生成的。结果,合成平行数据带来的翻译质量的改进大大减少了。在本文中,我们提出了一个新型的自动修理(AR)框架,以提高合成数据的质量。我们提出的AR模型可以通过使用BT和FT技术的大规模单语言数据来学习从低质量(嘈杂)输入句子到高质量句子的转变。拟议的AR模型将充分消除合成平行数据中的噪声,然后修复的合成平行数据可以帮助NMT模型实现更大的改进。实验结果表明,我们的方法可以有效地提高合成平行数据的质量和NMT模型,并通过修复的合成数据可以对WMT14 EN!de和IWSLT14 de!EN翻译任务进行一致的改进。
Compared with only using limited authentic parallel data as training corpus, many studies have proved that incorporating synthetic parallel data, which generated by back translation (BT) or forward translation (FT, or selftraining), into the NMT training process can significantly improve translation quality. However, as a well-known shortcoming, synthetic parallel data is noisy because they are generated by an imperfect NMT system. As a result, the improvements in translation quality bring by the synthetic parallel data are greatly diminished. In this paper, we propose a novel Auto- Repair (AR) framework to improve the quality of synthetic data. Our proposed AR model can learn the transformation from low quality (noisy) input sentence to high quality sentence based on large scale monolingual data with BT and FT techniques. The noise in synthetic parallel data will be sufficiently eliminated by the proposed AR model and then the repaired synthetic parallel data can help the NMT models to achieve larger improvements. Experimental results show that our approach can effective improve the quality of synthetic parallel data and the NMT model with the repaired synthetic data achieves consistent improvements on both WMT14 EN!DE and IWSLT14 DE!EN translation tasks.