论文标题

在线缺少价值插补和更改点检测

Online Missing Value Imputation and Change Point Detection with the Gaussian Copula

论文作者

Zhao, Yuxuan, Landgrebe, Eric, Shekhtman, Eliot, Udell, Madeleine

论文摘要

缺少价值插补对于实际数据科学工作流程至关重要。在在线环境中,插补更难,因为它要求插补方法本身能够随着时间的流逝而发展。对于实际应用,归纳算法应产生与真实数据分布相匹配,处理混合类型的数据(包括序数,布尔值和连续变量)的数据,并扩展到大型数据集。在这项工作中,我们为使用高斯副总统开发了一种用于混合数据的新的在线插图算法。在线高斯Copula模型符合所有的Desiderata:其归档即使是混合数据的数据分布,在流媒体数据具有变化的分布以及速度(最多达到数量级)时,尤其是在大型数据集上的速度(最多订单)时,在其离线上的数据分配也可以提高其脱机等。通过将Copula模型拟合到在线数据中,我们还提供了一种新方法,以检测具有缺失值的多元依赖结构中的变更点。关于合成和现实世界数据的实验结果验证了所提出的方法的性能。

Missing value imputation is crucial for real-world data science workflows. Imputation is harder in the online setting, as it requires the imputation method itself to be able to evolve over time. For practical applications, imputation algorithms should produce imputations that match the true data distribution, handle data of mixed types, including ordinal, boolean, and continuous variables, and scale to large datasets. In this work we develop a new online imputation algorithm for mixed data using the Gaussian copula. The online Gaussian copula model meets all the desiderata: its imputations match the data distribution even for mixed data, improve over its offline counterpart on the accuracy when the streaming data has a changing distribution, and on the speed (up to an order of magnitude) especially on large scale datasets. By fitting the copula model to online data, we also provide a new method to detect change points in the multivariate dependence structure with missing values. Experimental results on synthetic and real world data validate the performance of the proposed methods.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源