论文标题
与对话的无监督评估交互式对话框
Unsupervised Evaluation of Interactive Dialog with DialoGPT
论文作者
论文摘要
为开放域对话研究定义有意义且可解释的自动评估指标很重要。标准语言产生指标已被证明对话框无效。本文介绍了Fed Metric(对话框的细粒度评估),这是一种自动评估指标,使用对话框,而无需进行任何微调或监督。它还介绍了FED数据集,该数据集是通过注释一组人类系统和人类对话,并具有18个细粒度的对话品质。美联储度量(1)不依赖基本真相的响应,(2)不需要培训数据,(3)在回合和整个对话级别上都衡量细粒度的对话质量。美联储在两个层面上都与人类判断力相关。
It is important to define meaningful and interpretable automatic evaluation metrics for open-domain dialog research. Standard language generation metrics have been shown to be ineffective for dialog. This paper introduces the FED metric (fine-grained evaluation of dialog), an automatic evaluation metric which uses DialoGPT, without any fine-tuning or supervision. It also introduces the FED dataset which is constructed by annotating a set of human-system and human-human conversations with eighteen fine-grained dialog qualities. The FED metric (1) does not rely on a ground-truth response, (2) does not require training data and (3) measures fine-grained dialog qualities at both the turn and whole dialog levels. FED attains moderate to strong correlation with human judgement at both levels.