论文标题
对用于衡量人工智能进度的指标的批判性分析
A critical analysis of metrics used for measuring progress in artificial intelligence
论文作者
论文摘要
比较基准数据集上的模型性能是衡量和推动人工智能进度的组成部分。通常根据单个或一小部分性能指标评估模型在基准数据集上的性能。尽管这可以进行快速比较,但如果指标不能充分涵盖所有性能特征,则可能会出现不充分反映模型性能的风险。这可能在多大程度上影响基准的努力。 为了解决这个问题,我们根据涵盖3867机器学习模型性能的数据分析了当前性能指标的景观,从公开存储库“带代码的论文”结果。我们的结果表明,当前使用的大多数指标具有可能导致模型性能反射不足的属性。尽管已经提出了解决问题属性的替代指标,但目前很少使用它们。 此外,我们描述了报告指标的歧义,这可能会导致解释和比较模型性能的困难。
Comparing model performances on benchmark datasets is an integral part of measuring and driving progress in artificial intelligence. A model's performance on a benchmark dataset is commonly assessed based on a single or a small set of performance metrics. While this enables quick comparisons, it may entail the risk of inadequately reflecting model performance if the metric does not sufficiently cover all performance characteristics. It is unknown to what extent this might impact benchmarking efforts. To address this question, we analysed the current landscape of performance metrics based on data covering 3867 machine learning model performance results from the open repository 'Papers with Code'. Our results suggest that the large majority of metrics currently used have properties that may result in an inadequate reflection of a models' performance. While alternative metrics that address problematic properties have been proposed, they are currently rarely used. Furthermore, we describe ambiguities in reported metrics, which may lead to difficulties in interpreting and comparing model performances.