$β$ -CORS：在异常值的存在下，强大的大规模贝叶斯数据汇总

论文标题

$β$ -CORS：在异常值的存在下，强大的大规模贝叶斯数据汇总

$β$-Cores: Robust Large-Scale Bayesian Data Summarization in the Presence of Outliers

论文作者

Manousakas, Dionysis, Mascolo, Cecilia

论文摘要

现代的机器学习应用程序应该能够解决有关大规模现实世界数据集推断的内在挑战，包括对异常值的可扩展性和鲁棒性。尽管贝叶斯方法（例如不确定性意识预测，专家知识的融合和等级建模）有多重好处，但经典贝叶斯推断的质量批评取决于观察结果是否符合假定的数据生成模型，这在实践中是不可能保证的。在这项工作中，我们提出了一种变异推理方法，该方法以原则上的方式可以同时扩展到大型数据集，并在观察到的数据中相对于异常值的存在来鲁棒性地后验。通过$β$ divergence对贝叶斯定理进行重新定理，我们认为稳健的伪山山后后期是推理的靶标。此外，依靠riemannian核心的最新配方进行可伸缩的贝叶斯推断，我们提出了可靠的后验和有效的随机黑盒算法的稀疏变异近似来构造它。总体而言，我们的方法允许释放可以在包括结构化数据损坏在内的方案中广泛应用的清洁数据摘要。我们说明了我们的方法在不同的模拟和真实数据集中的适用性，以及包括高斯平均推理，逻辑和神经线性回归在内的各种统计模型，证明了在存在异常值的情况下，其优越性比现有的贝叶斯摘要方法。

Modern machine learning applications should be able to address the intrinsic challenges arising over inference on massive real-world datasets, including scalability and robustness to outliers. Despite the multiple benefits of Bayesian methods (such as uncertainty-aware predictions, incorporation of experts knowledge, and hierarchical modeling), the quality of classic Bayesian inference depends critically on whether observations conform with the assumed data generating model, which is impossible to guarantee in practice. In this work, we propose a variational inference method that, in a principled way, can simultaneously scale to large datasets, and robustify the inferred posterior with respect to the existence of outliers in the observed data. Reformulating Bayes theorem via the $β$-divergence, we posit a robustified pseudo-Bayesian posterior as the target of inference. Moreover, relying on the recent formulations of Riemannian coresets for scalable Bayesian inference, we propose a sparse variational approximation of the robustified posterior and an efficient stochastic black-box algorithm to construct it. Overall our method allows releasing cleansed data summaries that can be applied broadly in scenarios including structured data corruption. We illustrate the applicability of our approach in diverse simulated and real datasets, and various statistical models, including Gaussian mean inference, logistic and neural linear regression, demonstrating its superiority to existing Bayesian summarization methods in the presence of outliers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题