不确定性感知的伪标签选择，用于积极标记的学习

论文标题

不确定性感知的伪标签选择，用于积极标记的学习

Uncertainty-aware Pseudo-label Selection for Positive-Unlabeled Learning

论文作者

Dorigatti, Emilio, Goschenhofer, Jann, Schubert, Benjamin, Rezaei, Mina, Bischl, Bernd

论文摘要

积极的未标记学习（PUL）旨在仅从积极和未标记的培训数据中学习二进制分类器。尽管实际应用程序通常涉及大多数示例属于一个类的数据集，但大多数当代PUL的方法都不会在这种情况下调查性能，从而严重限制了其在实践中的适用性。 In this work, we thus propose to tackle the issues of imbalanced datasets and model calibration in a PUL setting through an uncertainty-aware pseudo-labeling procedure (PUUPL): by boosting the signal from the minority class, pseudo-labeling expands the labeled dataset with new samples from the unlabeled set, while explicit uncertainty quantification prevents the emergence of harmful confirmation bias leading to increased预测性能。在一系列实验中，PUUPL在高度不平衡的设置中产生了可观的性能，同时在近期基线的平衡PU场景中也表现出强劲的性能。此外，我们提供了消融和灵敏度分析，以阐明Puupl的几种成分。最后，一个具有不平衡数据集的现实世界应用程序证实了我们方法的优势。

Positive-unlabeled learning (PUL) aims at learning a binary classifier from only positive and unlabeled training data. Even though real-world applications often involve imbalanced datasets where the majority of examples belong to one class, most contemporary approaches to PUL do not investigate performance in this setting, thus severely limiting their applicability in practice. In this work, we thus propose to tackle the issues of imbalanced datasets and model calibration in a PUL setting through an uncertainty-aware pseudo-labeling procedure (PUUPL): by boosting the signal from the minority class, pseudo-labeling expands the labeled dataset with new samples from the unlabeled set, while explicit uncertainty quantification prevents the emergence of harmful confirmation bias leading to increased predictive performance. Within a series of experiments, PUUPL yields substantial performance gains in highly imbalanced settings while also showing strong performance in balanced PU scenarios across recent baselines. We furthermore provide ablations and sensitivity analyses to shed light on PUUPL's several ingredients. Finally, a real-world application with an imbalanced dataset confirms the advantage of our approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题