论文标题
与自我判断的强大而有效的不平衡的积极未标记的学习
Robust and Efficient Imbalanced Positive-Unlabeled Learning with Self-supervision
论文作者
论文摘要
从积极和未标记的(PU)数据中学习是一种设置,其中学习者只能访问正面和未标记的样本,而没有关于负面示例的信息。这种PU设置在各种任务中非常重要,例如医学诊断,社交网络分析,金融市场分析和知识基础完成,这些任务也往往本质上是不平衡的,即大多数示例实际上是负面的。但是,大多数现有的PU学习方法仅考虑人工平衡的数据集,目前尚不清楚它们在不平衡和长尾数据分布的现实情况下的表现如何。本文提议通过强大而有效的自我监督预处理来应对这一挑战。但是,培训传统的自我监督学习方法使用高度不平衡的PU分布需要更好的重新重新制定。在本文中,我们提出\ textIt {Impuls},这是\ usepline {im}平衡\下划线{p} ositive \ unesive \ usepline \ usepline {u} nlabeLed \ useverline {l linese {l} gentline {lfaveraging \ listline {se se} lf- \ useverline {s prelline {s} epperce dep- Impulses使用了大规模无监督学习的一般组合以及对比度损失和额外重新持续的PU损失的一般组合。我们在多个数据集上进行了不同的实验,以表明Impuls能够将前一个最新的错误率减半,即使与先前给出的先验方法相比。此外,即使在无关的数据集上进行了预处理,我们的方法对事先错误指定和出色的性能表现出了提高的鲁棒性。我们预计,这种稳健性和效率将使从业者更容易在其他感兴趣的PU数据集上获得出色的结果。源代码可在\ url {https://github.com/jschweisthal/impulses}中获得
Learning from positive and unlabeled (PU) data is a setting where the learner only has access to positive and unlabeled samples while having no information on negative examples. Such PU setting is of great importance in various tasks such as medical diagnosis, social network analysis, financial markets analysis, and knowledge base completion, which also tend to be intrinsically imbalanced, i.e., where most examples are actually negatives. Most existing approaches for PU learning, however, only consider artificially balanced datasets and it is unclear how well they perform in the realistic scenario of imbalanced and long-tail data distribution. This paper proposes to tackle this challenge via robust and efficient self-supervised pretraining. However, training conventional self-supervised learning methods when applied with highly imbalanced PU distribution needs better reformulation. In this paper, we present \textit{ImPULSeS}, a unified representation learning framework for \underline{Im}balanced \underline{P}ositive \underline{U}nlabeled \underline{L}earning leveraging \underline{Se}lf-\underline{S}upervised debiase pre-training. ImPULSeS uses a generic combination of large-scale unsupervised learning with debiased contrastive loss and additional reweighted PU loss. We performed different experiments across multiple datasets to show that ImPULSeS is able to halve the error rate of the previous state-of-the-art, even compared with previous methods that are given the true prior. Moreover, our method showed increased robustness to prior misspecification and superior performance even when pretraining was performed on an unrelated dataset. We anticipate such robustness and efficiency will make it much easier for practitioners to obtain excellent results on other PU datasets of interest. The source code is available at \url{https://github.com/JSchweisthal/ImPULSeS}