论文标题
一项研究,以不同的自动释义改善BLEU参考覆盖范围
A Study in Improving BLEU Reference Coverage with Diverse Automatic Paraphrasing
论文作者
论文摘要
我们研究了BLEU的典型使用中长期以来的缺点:其对单个参考的依赖。使用现代的神经释义技术,我们研究是否会自动产生其他不同的参考文献可以更好地覆盖有效翻译的空间,从而改善其与人类判断的相关性。我们在WMT19指标任务(在系统和句子级别上)的英语语言方向的实验表明,使用基于释义的参考通常会改善bleu,而当它确实如此,越多样化越多。但是,我们还表明,如果这些解释是要专门针对与正在评估的MT输出最相关的空间部分,则可以实现更好的结果。此外,即使使用人类释义,收益仍然很小,这表明了BLEU正确利用多个参考的能力的固有局限性。令人惊讶的是,我们还发现,正如强大的抽样方法的高结果所示,充分性似乎不太重要,与句子级别的bleu一起使用时,它甚至超过了人类的释义。
We investigate a long-perceived shortcoming in the typical use of BLEU: its reliance on a single reference. Using modern neural paraphrasing techniques, we study whether automatically generating additional diverse references can provide better coverage of the space of valid translations and thereby improve its correlation with human judgments. Our experiments on the into-English language directions of the WMT19 metrics task (at both the system and sentence level) show that using paraphrased references does generally improve BLEU, and when it does, the more diverse the better. However, we also show that better results could be achieved if those paraphrases were to specifically target the parts of the space most relevant to the MT outputs being evaluated. Moreover, the gains remain slight even when human paraphrases are used, suggesting inherent limitations to BLEU's capacity to correctly exploit multiple references. Surprisingly, we also find that adequacy appears to be less important, as shown by the high results of a strong sampling approach, which even beats human paraphrases when used with sentence-level BLEU.