与对话的无监督评估交互式对话框

论文标题

与对话的无监督评估交互式对话框

Unsupervised Evaluation of Interactive Dialog with DialoGPT

论文作者

Mehri, Shikib, Eskenazi, Maxine

论文摘要

为开放域对话研究定义有意义且可解释的自动评估指标很重要。标准语言产生指标已被证明对话框无效。本文介绍了Fed Metric（对话框的细粒度评估），这是一种自动评估指标，使用对话框，而无需进行任何微调或监督。它还介绍了FED数据集，该数据集是通过注释一组人类系统和人类对话，并具有18个细粒度的对话品质。美联储度量（1）不依赖基本真相的响应，（2）不需要培训数据，（3）在回合和整个对话级别上都衡量细粒度的对话质量。美联储在两个层面上都与人类判断力相关。

It is important to define meaningful and interpretable automatic evaluation metrics for open-domain dialog research. Standard language generation metrics have been shown to be ineffective for dialog. This paper introduces the FED metric (fine-grained evaluation of dialog), an automatic evaluation metric which uses DialoGPT, without any fine-tuning or supervision. It also introduces the FED dataset which is constructed by annotating a set of human-system and human-human conversations with eighteen fine-grained dialog qualities. The FED metric (1) does not rely on a ground-truth response, (2) does not require training data and (3) measures fine-grained dialog qualities at both the turn and whole dialog levels. FED attains moderate to strong correlation with human judgement at both levels.

下载PDF全文

下载文献需遵守相关版权规定

论文标题