论文标题
多营解析评估的脆弱性
The Fragility of Multi-Treebank Parsing Evaluation
论文作者
论文摘要
尚未详细探讨Treebank的选择,用于解析评估和可能是由偏见的选择产生的伪造效应。本文研究了对树库的单个子集的评估如何导致结论较弱。首先,我们采用一些对比的解析器,并将其运行在先前工作中提出的树库的子集上,其用途是合理的(或不适合),例如类型学或数据稀缺性。其次,我们运行了该实验的大规模版本,创建大量的Treebanks随机子集,并比较许多分数可用的解析器。结果表明,各个子集的差异很大,尽管建立良好的树牛银行选择准则是很难的,但仍有可能检测潜在的有害策略。
Treebank selection for parsing evaluation and the spurious effects that might arise from a biased choice have not been explored in detail. This paper studies how evaluating on a single subset of treebanks can lead to weak conclusions. First, we take a few contrasting parsers, and run them on subsets of treebanks proposed in previous work, whose use was justified (or not) on criteria such as typology or data scarcity. Second, we run a large-scale version of this experiment, create vast amounts of random subsets of treebanks, and compare on them many parsers whose scores are available. The results show substantial variability across subsets and that although establishing guidelines for good treebank selection is hard, it is possible to detect potentially harmful strategies.