论文标题
De Bruijn图上的近似模式匹配的复杂性
The Complexity of Approximate Pattern Matching on De Bruijn Graphs
论文作者
论文摘要
将序列与标记图中的步行保持一致是对计算生物学的基本重要性的问题。为了在任意图中找到$ | e | $边缘的散步,它与长度$ m $的模式完全匹配,基于强级指数时间假设(SETH)的下限意味着算法的速度明显快于$ o(| e | m)$ time的时间不太可能[equi et equi et al。,icalp 2019]。但是,对于许多特殊图,例如de bruijn图,可以在线性时间中解决该问题[Bowe等,Wabi 2012]。对于近似匹配,图片更为复杂。当仅允许将编辑(替换,插入和删除)到图案中,或者图形为无环时,问题再次可在$ o(| e | m)$时间中解决。当允许编辑任意循环图时,即使在二进制字母上,问题也将成为NP完整的[Jain等人,Recomb 2019]。即使仅限于替换,这些结果也会成立。 De Bruijn图上近似模式匹配的复杂性保持开放。我们研究了这个问题,并表明使De Bruijn图的属性适合于有效的精确模式匹配也不会扩展到近似匹配,即使仅限于仅具有字母大小四的替代情况。我们证明,当允许替换为图表时,确定在de bruijn图中的匹配步行的存在是NP的。此外,我们证明,在仅允许替换为模式的情况下,对于de bruijn图,对于de bruijn图的算法显着快。这与模式到文本的匹配形成鲜明对比,在线性时间可以解决确切的匹配时间,例如在de bruijn图上,但是在替换下的近似匹配可以在子级别的$ o(n \ sqrt {m})$ o(n \ sqrt {m})$ time中求解,而$ n $ n $ n $是文本的长度[abrahamson,abrahamson,siam siam siam siam siam J.计算1987]。
Aligning a sequence to a walk in a labeled graph is a problem of fundamental importance to Computational Biology. For finding a walk in an arbitrary graph with $|E|$ edges that exactly matches a pattern of length $m$, a lower bound based on the Strong Exponential Time Hypothesis (SETH) implies an algorithm significantly faster than $O(|E|m)$ time is unlikely [Equi et al., ICALP 2019]. However, for many special graphs, such as de Bruijn graphs, the problem can be solved in linear time [Bowe et al., WABI 2012]. For approximate matching, the picture is more complex. When edits (substitutions, insertions, and deletions) are only allowed to the pattern, or when the graph is acyclic, the problem is again solvable in $O(|E|m)$ time. When edits are allowed to arbitrary cyclic graphs, the problem becomes NP-complete, even on binary alphabets [Jain et al., RECOMB 2019]. These results hold even when edits are restricted to only substitutions. The complexity of approximate pattern matching on de Bruijn graphs remained open. We investigate this problem and show that the properties that make de Bruijn graphs amenable to efficient exact pattern matching do not extend to approximate matching, even when restricted to the substitutions only case with alphabet size four. We prove that determining the existence of a matching walk in a de Bruijn graph is NP-complete when substitutions are allowed to the graph. In addition, we demonstrate that an algorithm significantly faster than $O(|E|m)$ is unlikely for de Bruijn graphs in the case where only substitutions are allowed to the pattern. This stands in contrast to pattern-to-text matching where exact matching is solvable in linear time, like on de Bruijn graphs, but approximate matching under substitutions is solvable in subquadratic $O(n\sqrt{m})$ time, where $n$ is the text's length [Abrahamson, SIAM J. Computing 1987].