神经主动学习在异性分布上

论文标题

神经主动学习在异性分布上

Neural Active Learning on Heteroskedastic Distributions

论文作者

Khosla, Savya, Whye, Chew Kin, Ash, Jordan T., Zhang, Cyril, Kawaguchi, Kenji, Lamb, Alex

论文摘要

可以积极寻找最优质培训数据的模型具有更准确，适应性和高效的机器学习的希望。主动学习技术通常倾向于喜欢最难分类的例子。尽管这在均匀数据集上很好地工作，但我们发现，当在具有不同程度的标签噪声或异性恋性的多个分布上执行时，它可能导致灾难性失败。这些主动的学习算法强烈更喜欢从分布中汲取更多噪音，即使它们的示例没有信息结构（例如带有随机标签的纯色图像）。为此，我们证明了这些主动学习算法对异性分布的灾难性失败，并提出了一种基于微调的方法来减轻这些失败。此外，我们提出了一种新算法，该算法包含每个数据点的模型差评分函数，以滤除嘈杂的示例和样品干净的示例，以最大程度地提高准确性，从而超过了HeteroSkedastic数据集上现有的活动性学习技术。我们希望这些观察和技术立即对从业者有所帮助，并可以帮助挑战主动学习算法设计中的共同假设。

Models that can actively seek out the best quality training data hold the promise of more accurate, adaptable, and efficient machine learning. Active learning techniques often tend to prefer examples that are the most difficult to classify. While this works well on homogeneous datasets, we find that it can lead to catastrophic failures when performed on multiple distributions with different degrees of label noise or heteroskedasticity. These active learning algorithms strongly prefer to draw from the distribution with more noise, even if their examples have no informative structure (such as solid color images with random labels). To this end, we demonstrate the catastrophic failure of these active learning algorithms on heteroskedastic distributions and propose a fine-tuning-based approach to mitigate these failures. Further, we propose a new algorithm that incorporates a model difference scoring function for each data point to filter out the noisy examples and sample clean examples that maximize accuracy, outperforming the existing active learning techniques on the heteroskedastic datasets. We hope these observations and techniques are immediately helpful to practitioners and can help to challenge common assumptions in the design of active learning algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题