论文标题
分析数据集中的性能不稳定的诅咒:后果,来源和建议
The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions
论文作者
论文摘要
我们发现,最先进的自然语言推理(NLI)和阅读理解(RC)分析/压力集的表现可能是高度不稳定的。这提出了三个问题:(1)不稳定性将如何根据这些分析集影响得出结论的可靠性? (2)这种不稳定来自哪里? (3)我们应该如何处理这种不稳定性,哪些潜在解决方案?对于第一个问题,我们对分析集进行了彻底的实证研究,发现除了最终表现不稳定外,沿训练曲线的不稳定也存在。我们还观察到分析验证集与标准验证集之间的相关性低于预期的相关性,从而质疑当前模型选择例程的有效性。接下来,要回答第二个问题,我们既给出了关于不稳定性来源的理论解释和经验证据,表明不稳定性主要来自分析集中的较高的示例相关性。最后,对于第三个问题,我们讨论了减轻不稳定性的初步尝试,并建议未来工作的准则,例如报告分解的差异,以提供更可解释的结果和整个模型的公平比较。我们的代码可公开可用:https://github.com/owenzx/instabilityanalysis
We find that the performance of state-of-the-art models on Natural Language Inference (NLI) and Reading Comprehension (RC) analysis/stress sets can be highly unstable. This raises three questions: (1) How will the instability affect the reliability of the conclusions drawn based on these analysis sets? (2) Where does this instability come from? (3) How should we handle this instability and what are some potential solutions? For the first question, we conduct a thorough empirical study over analysis sets and find that in addition to the unstable final performance, the instability exists all along the training curve. We also observe lower-than-expected correlations between the analysis validation set and standard validation set, questioning the effectiveness of the current model-selection routine. Next, to answer the second question, we give both theoretical explanations and empirical evidence regarding the source of the instability, demonstrating that the instability mainly comes from high inter-example correlations within analysis sets. Finally, for the third question, we discuss an initial attempt to mitigate the instability and suggest guidelines for future work such as reporting the decomposed variance for more interpretable results and fair comparison across models. Our code is publicly available at: https://github.com/owenzx/InstabilityAnalysis