机器学习管道：出处，可重复性和公平数据原则

论文标题

机器学习管道：出处，可重复性和公平数据原则

Machine Learning Pipelines: Provenance, Reproducibility and FAIR Data Principles

论文作者

Samuel, Sheeba, Löffler, Frank, König-Ries, Birgitta

论文摘要

机器学习（ML）是一个越来越重要的科学工具，支持了许多领域的决策和知识的产生。这样，ML实验的结果可再现也越来越重要。不幸的是，情况常常并非如此。相反，类似于许多其他学科的ML面临重复性危机。在本文中，我们描述了支持ML管道端到端可重复性的目标和初始步骤。我们研究了哪些因素超出源代码和数据集的可用性影响ML实验的可重复性。我们建议将公平数据实践应用于ML工作流程的方法。我们介绍了有关工具，预备簿在捕获和比较ML实验的出处及其使用Jupyter笔记本的可重复性方面的作用的初步结果。

Machine learning (ML) is an increasingly important scientific tool supporting decision making and knowledge generation in numerous fields. With this, it also becomes more and more important that the results of ML experiments are reproducible. Unfortunately, that often is not the case. Rather, ML, similar to many other disciplines, faces a reproducibility crisis. In this paper, we describe our goals and initial steps in supporting the end-to-end reproducibility of ML pipelines. We investigate which factors beyond the availability of source code and datasets influence reproducibility of ML experiments. We propose ways to apply FAIR data practices to ML workflows. We present our preliminary results on the role of our tool, ProvBook, in capturing and comparing provenance of ML experiments and their reproducibility using Jupyter Notebooks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题