用于机器指导序列设计的分配班次下的预测标签

论文标题

用于机器指导序列设计的分配班次下的预测标签

Forecasting labels under distribution-shift for machine-guided sequence design

论文作者

Wheelock, Lauren Berk, Malina, Stephen, Gerold, Jeffrey, Sinai, Sam

论文摘要

设计和优化具有特定功能的生物序列的能力将释放技术和医疗保健中的巨大价值。近年来，机器学习引导的序列设计已大大取得了进步，尽管在实验室或诊所中验证了设计的序列需要数月的范围，并且大量劳动。因此，在将资源投入到实验之前，评估设计集包含所需质量的序列（通常位于标签分布之外）的可能性是有价值的。预测是在许多可以延迟反馈（例如选举）的许多领域中的突出概念，在序列设计的背景下尚未使用或研究。在这里，我们提出了一种指导决策的方法，以预测基于模型提供的估计值的高通量库（例如包含$ 10^5 $唯一变体）的性能，这为库中标签的分布提供了后验。我们表明，我们的方法优于天真地使用模型得分来估算库性能的基准，这是当今为此目的可用的工具。

The ability to design and optimize biological sequences with specific functionalities would unlock enormous value in technology and healthcare. In recent years, machine learning-guided sequence design has progressed this goal significantly, though validating designed sequences in the lab or clinic takes many months and substantial labor. It is therefore valuable to assess the likelihood that a designed set contains sequences of the desired quality (which often lies outside the label distribution in our training data) before committing resources to an experiment. Forecasting, a prominent concept in many domains where feedback can be delayed (e.g. elections), has not been used or studied in the context of sequence design. Here we propose a method to guide decision-making that forecasts the performance of high-throughput libraries (e.g. containing $10^5$ unique variants) based on estimates provided by models, providing a posterior for the distribution of labels in the library. We show that our method outperforms baselines that naively use model scores to estimate library performance, which are the only tool available today for this purpose.

下载PDF全文

下载文献需遵守相关版权规定

论文标题