论文标题
使用共同信息机器改善小分子生成
Improving Small Molecule Generation using Mutual Information Machine
论文作者
论文摘要
我们解决了受控生成小分子的任务,该任务需要在某些约束(例如,与参考分子相似)下找到具有所需特性的新分子。在这里,我们介绍了小分子药物发现的概率自动编码器Molmim,它了解了一个信息丰富且聚集的潜在空间。 Molmim通过共同信息机(MIM)学习训练,并提供可变长度微笑字符串的固定长度表示。由于编码器模型可以通过无效样品的``孔''来学习表示形式,因此在这里我们提出了训练程序的新型扩展,该过程促进了促进密集的潜在空间,并允许模型从潜在的潜在代码的随机扰动中采样有效分子。我们提供了Molmim与几个可变大小和固定尺寸的编码器模型的彻底比较,这表明了Molmim的上一代,如有效性,独特性和新颖性而言。然后,我们利用莫尔米姆的潜在空间来利用cma-es,一种天真的黑盒和无梯度的搜索算法,以实现属性指导分子优化的任务。我们实现了最新的单个属性优化任务以及多目标优化的具有挑战性的任务,从而提高了先前的成功率SOTA超过5 \%。我们将强大的结果归因于Molmim的潜在表示,这些表示在潜在空间中将相似的分子簇簇,而CMA-ES通常用作基线优化方法。我们还证明了Molmim在计算有限的制度中有利,使其成为此类情况的有吸引力的模型。
We address the task of controlled generation of small molecules, which entails finding novel molecules with desired properties under certain constraints (e.g., similarity to a reference molecule). Here we introduce MolMIM, a probabilistic auto-encoder for small molecule drug discovery that learns an informative and clustered latent space. MolMIM is trained with Mutual Information Machine (MIM) learning, and provides a fixed length representation of variable length SMILES strings. Since encoder-decoder models can learn representations with ``holes'' of invalid samples, here we propose a novel extension to the training procedure which promotes a dense latent space, and allows the model to sample valid molecules from random perturbations of latent codes. We provide a thorough comparison of MolMIM to several variable-size and fixed-size encoder-decoder models, demonstrating MolMIM's superior generation as measured in terms of validity, uniqueness, and novelty. We then utilize CMA-ES, a naive black-box and gradient free search algorithm, over MolMIM's latent space for the task of property guided molecule optimization. We achieve state-of-the-art results in several constrained single property optimization tasks as well as in the challenging task of multi-objective optimization, improving over previous success rate SOTA by more than 5\% . We attribute the strong results to MolMIM's latent representation which clusters similar molecules in the latent space, whereas CMA-ES is often used as a baseline optimization method. We also demonstrate MolMIM to be favourable in a compute limited regime, making it an attractive model for such cases.