一种用于查找最大共同子序列的快速随机算法

论文标题

一种用于查找最大共同子序列的快速随机算法

A Fast Randomized Algorithm for Finding the Maximal Common Subsequences

论文作者

Cao, Jin, Zhong, Dewei

论文摘要

在生物信息学，计算语言学和信息检索方面，找到$ l $多元字符串的常见子序列有许多应用。一个众所周知的结果指出，为$ l $ strings找到最长的常见子序列（LCS）是NP-HARD，例如，计算复杂性在$ L $中为指数。在本文中，我们开发了一种随机算法，称为{\ em tandom-mcs}，用于查找多个字符串的最大常见子序列（$ MCS $）的随机实例。如果将任何字符插入子序列不再产生共同的子序列，则常见的子序列是{\ em maxal}。 MCS的特殊情况是长度最长的LCS。我们显示我们的算法的复杂性是$ L $的线性，因此适用于大型$ L $。此外，我们研究了MC的单个实例的发生概率，并通过理论和实验研究证明了{\ em Random-MCS}的多个运行中最长的子序列通常会产生$ LCS $的解决方案。

Finding the common subsequences of $L$ multiple strings has many applications in the area of bioinformatics, computational linguistics, and information retrieval. A well-known result states that finding a Longest Common Subsequence (LCS) for $L$ strings is NP-hard, e.g., the computational complexity is exponential in $L$. In this paper, we develop a randomized algorithm, referred to as {\em Random-MCS}, for finding a random instance of Maximal Common Subsequence ($MCS$) of multiple strings. A common subsequence is {\em maximal} if inserting any character into the subsequence no longer yields a common subsequence. A special case of MCS is LCS where the length is the longest. We show the complexity of our algorithm is linear in $L$, and therefore is suitable for large $L$. Furthermore, we study the occurrence probability for a single instance of MCS and demonstrate via both theoretical and experimental studies that the longest subsequence from multiple runs of {\em Random-MCS} often yields a solution to $LCS$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题