论文标题
关于文本生成器评估的嵌入,簇和字符串的有用性
On the Usefulness of Embeddings, Clusters and Strings for Text Generator Evaluation
论文作者
论文摘要
语言生成的良好自动评估指标理想地与人类对文本质量的判断高度相关。但是,缺乏这种指标,它抑制了语言发生器的快速进步。最近提议的淡紫色是一个例外。从理论上讲,淡紫色衡量了两个概率分布在字符串上的信息理论差异:一个代表评估中的语言发生器;另一个代表真正的自然语言分布。梅夫(Mauve)的作者认为,其成功来自其提议分歧的定性特性。然而,实际上,由于这种差异是无法兼容的,因此淡紫色通过测量多项式分布之间的差异而不是集群,在这种情况下,通过基于预先训练的语言模型的嵌入来实现字符串来实现群集分配。但是,正如我们所显示的,这不是理论或实践中的紧密近似。这就提出了一个问题:为什么淡紫色的工作正常?在这项工作中,我们证明淡紫色是出于错误的原因是正确的,并且其新提出的分歧对于高性能并不是必需的。实际上,经典的分歧与提出的基于群集的近似实际上可以作为更好的评估指标。我们通过探测分析完成了论文;该分析使我们得出结论,通过编码文本的句法和相干级特征,同时忽略了表面级别的特征,这种基于群集的替代品可以简单地用于评估最先进的语言生成器。
A good automatic evaluation metric for language generation ideally correlates highly with human judgements of text quality. Yet, there is a dearth of such metrics, which inhibits the rapid and efficient progress of language generators. One exception is the recently proposed Mauve. In theory, Mauve measures an information-theoretic divergence between two probability distributions over strings: one representing the language generator under evaluation; the other representing the true natural language distribution. Mauve's authors argue that its success comes from the qualitative properties of their proposed divergence. Yet in practice, as this divergence is uncomputable, Mauve approximates it by measuring the divergence between multinomial distributions over clusters instead, where cluster assignments are attained by grouping strings based on a pre-trained language model's embeddings. As we show, however, this is not a tight approximation -- in either theory or practice. This begs the question: why does Mauve work so well? In this work, we show that Mauve was right for the wrong reasons, and that its newly proposed divergence is not necessary for its high performance. In fact, classical divergences paired with its proposed cluster-based approximation may actually serve as better evaluation metrics. We finish the paper with a probing analysis; this analysis leads us to conclude that -- by encoding syntactic- and coherence-level features of text, while ignoring surface-level features -- such cluster-based substitutes to string distributions may simply be better for evaluating state-of-the-art language generators.