基准对糖尿病性视网膜病变检测任务的贝叶斯深度学习

论文标题

基准对糖尿病性视网膜病变检测任务的贝叶斯深度学习

Benchmarking Bayesian Deep Learning on Diabetic Retinopathy Detection Tasks

论文作者

Band, Neil, Rudner, Tim G. J., Feng, Qixuan, Filos, Angelos, Nado, Zachary, Dusenberry, Michael W., Jerfel, Ghassen, Tran, Dustin, Gal, Yarin

论文摘要

贝叶斯深度学习旨在为深度神经网络提供精确量化其预测性不确定性的能力，并已承诺使深度学习对安全至关重要的现实世界应用更可靠。然而，现有的贝叶斯深度学习方法没有这种诺言。在不切实际的测试床上继续评估新方法，这些测试床并未反映下游现实世界任务的复杂性，这些任务将受益于可靠的不确定性量化。我们提出了视网膜基准，这是一组现实世界中的任务，这些任务准确地反映了这种复杂性，并旨在评估在安全至关重要的情况下预测模型的可靠性。具体而言，我们策划了两个公开可用的高分辨率人视网膜图像数据集，这些数据集表现出不同程度的糖尿病性视网膜病，这是一种可以导致失明的医疗状况，并使用它们来设计一套自动诊断任务，这些任务需要可靠的预测性不确定性定量。我们使用这些任务对特定于任务的评估指标进行了良好的和最先进的贝叶斯深度学习方法。我们提供了易于使用的代码库，可在可重复性和软件设计原理后快速简便地进行基准测试。我们提供了基准中包括的所有方法的实现，以及在100 tpu天，20 GPU天，400个超参数配置以及每个至少6个随机种子的评估中计算的结果。

Bayesian deep learning seeks to equip deep neural networks with the ability to precisely quantify their predictive uncertainty, and has promised to make deep learning more reliable for safety-critical real-world applications. Yet, existing Bayesian deep learning methods fall short of this promise; new methods continue to be evaluated on unrealistic test beds that do not reflect the complexities of downstream real-world tasks that would benefit most from reliable uncertainty quantification. We propose the RETINA Benchmark, a set of real-world tasks that accurately reflect such complexities and are designed to assess the reliability of predictive models in safety-critical scenarios. Specifically, we curate two publicly available datasets of high-resolution human retina images exhibiting varying degrees of diabetic retinopathy, a medical condition that can lead to blindness, and use them to design a suite of automated diagnosis tasks that require reliable predictive uncertainty quantification. We use these tasks to benchmark well-established and state-of-the-art Bayesian deep learning methods on task-specific evaluation metrics. We provide an easy-to-use codebase for fast and easy benchmarking following reproducibility and software design principles. We provide implementations of all methods included in the benchmark as well as results computed over 100 TPU days, 20 GPU days, 400 hyperparameter configurations, and evaluation on at least 6 random seeds each.

下载PDF全文

下载文献需遵守相关版权规定

论文标题