VAMSA：数据科学脚本中的自动出处跟踪

论文标题

VAMSA：数据科学脚本中的自动出处跟踪

Vamsa: Automated Provenance Tracking in Data Science Scripts

论文作者

Namaki, Mohammad Hossein, Floratou, Avrilia, Psallidas, Fotis, Krishnan, Subru, Agrawal, Ashvin, Wu, Yinghui, Zhu, Yiwen, Weimer, Markus

论文摘要

由于各种ML应用的不言而喻或法规要求，最近在机器学习（ML）模型的公平，偏见和解释性方面进行了许多正在进行的研究。我们进行以下观察：所有这些方法都需要对ML模型与用于训练它们的数据之间的关系有深入的了解。在这项工作中，我们介绍了ML出处跟踪问题：基本想法是自动跟踪数据集中的哪些列用于得出ML模型的功能/标签。我们讨论了在Python的背景下捕获此类信息的挑战，Python是数据科学家使用的最常见语言。然后，我们提出VAMSA，这是一种模块化系统，可从Python脚本中提取出处，而无需对用户代码进行任何更改。使用26K真实的数据科学脚本，我们在覆盖范围和性能方面验证了VAMSA的有效性。我们还评估了VAMSA在较小的手动标记数据子集上的准确性。我们的分析表明，VAMSA的精确度和召回量从90.4％到99.1％，其延迟符合平均大小脚本的毫秒顺序。从我们在生产中部署ML模型方面的经验中，我们还提供了一个示例，其中VAMSA帮助自动识别受数据损坏问题影响的模型。

There has recently been a lot of ongoing research in the areas of fairness, bias and explainability of machine learning (ML) models due to the self-evident or regulatory requirements of various ML applications. We make the following observation: All of these approaches require a robust understanding of the relationship between ML models and the data used to train them. In this work, we introduce the ML provenance tracking problem: the fundamental idea is to automatically track which columns in a dataset have been used to derive the features/labels of an ML model. We discuss the challenges in capturing such information in the context of Python, the most common language used by data scientists. We then present Vamsa, a modular system that extracts provenance from Python scripts without requiring any changes to the users' code. Using 26K real data science scripts, we verify the effectiveness of Vamsa in terms of coverage, and performance. We also evaluate Vamsa's accuracy on a smaller subset of manually labeled data. Our analysis shows that Vamsa's precision and recall range from 90.4% to 99.1% and its latency is in the order of milliseconds for average size scripts. Drawing from our experience in deploying ML models in production, we also present an example in which Vamsa helps automatically identify models that are affected by data corruption issues.

下载PDF全文

下载文献需遵守相关版权规定

论文标题