基因组组装，从实践到理论：安全，完整和线性时间

论文标题

基因组组装，从实践到理论：安全，完整和线性时间

Genome assembly, from practice to theory: safe, complete and linear-time

论文作者

Cairo, Massimo, Rizzi, Romeo, Tomescu, Alexandru I., Zirondelli, Elia C.

论文摘要

基因组组装要求从其许多较短的子字符串中重建一个未知字符串。即使它是生物信息学的关键问题之一，但通常缺乏重大的理论进步。它的硬度既源于实际问题（实际数据的大小和错误），也源于问题制定固有地接受多种解决方案的事实。鉴于这些以核心为核心，大多数最先进的组装程序都是基于在组装图中找到非分支路径（UNITIG）的基础。如果将基因组组装溶液定义为图形的封闭弧覆盖行走，则在所有溶液中出现unitigs，因此是安全的部分溶液。最近，所有这些安全的步行都被描述为综合，导致了第一个安全，完整的基因组组装算法。即使将Omnitig发现改善到二次时间，是否可以使用omnitigs获得查找单位的关键线性时间特征。我们描述了一种令人惊讶的$ o（m）$ - 时间算法，以识别具有$ n $节点和$ m $ arcs的图表的所有最大余额，尽管存在$θ（MN）$总最大值omnitig大小的图形家庭。这是基于发现一个属性的一系列步行家族（宏观），其特性是所有非平凡的综合词都是宏观曲霉的子研究物的单局部扩展，并产生两个后果：（1）线性输出时间敏感算法使所有最大含量占所有最大含量。（2）所有最大余额的紧凑型$ o（m）$表示，例如，对于$ o（m）$ - 对它们的各种统计信息进行计算。我们的结果结束了一个长期以来的理论问题，灵感来自实用基因组组装者，源自1995年使用Unitigs。我们设想我们的结果是从理论到实用和完整基因组组装程序的反向转移的核心，就像其他关键的生物信息学问题一样。

Genome assembly asks to reconstruct an unknown string from many shorter substrings of it. Even though it is one of the key problems in Bioinformatics, it is generally lacking major theoretical advances. Its hardness stems both from practical issues (size and errors of real data), and from the fact that problem formulations inherently admit multiple solutions. Given these, at their core, most state-of-the-art assemblers are based on finding non-branching paths (unitigs) in an assembly graph. If one defines a genome assembly solution as a closed arc-covering walk of the graph, then unitigs appear in all solutions, being thus safe partial solutions. All all such safe walks were recently characterized as omnitigs, leading to the first safe and complete genome assembly algorithm. Even if omnitig finding was improved to quadratic time, it remained open whether the crucial linear-time feature of finding unitigs can be attained with omnitigs. We describe a surprising $O(m)$-time algorithm to identify all maximal omnitigs of a graph with $n$ nodes and $m$ arcs, notwithstanding the existence of families of graphs with $Θ(mn)$ total maximal omnitig size. This is based on the discovery of a family of walks (macrotigs) with the property that all the non-trivial omnitigs are univocal extensions of subwalks of a macrotig, with two consequences: (1) A linear-time output-sensitive algorithm enumerating all maximal omnitigs. (2) A compact $O(m)$ representation of all maximal omnitigs, which allows, e.g., for $O(m)$-time computation of various statistics on them. Our results close a long-standing theoretical question inspired by practical genome assemblers, originating with the use of unitigs in 1995. We envision our results to be at the core of a reverse transfer from theory to practical and complete genome assembly programs, as has been the case for other key Bioinformatics problems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题