论文标题
锤子和Levenshtein距离之间的算法桥
An Algorithmic Bridge Between Hamming and Levenshtein Distances
论文作者
论文摘要
字符串之间的编辑距离经典地将单位成本分配给每个字符插入,删除和替换,而锤击距离仅允许替换。在许多现实生活中,插入和删除(缩写的Indels)经常出现,但与替代相比明显少得多。为了建模,我们认为替代品比Indels便宜,而参数$ a \ ge 1 $的价格为$ 1/a $。这种基本变体表示$ ed_a $,带有锤击距离($ a \ to \ infty $)的Bridges经典编辑距离($ a = 1 $),导致有趣的算法挑战:计算$ ed_a $的时间复杂性是否在锤击距离(线性时间)和编辑距离(线性时间)和编辑距离(Quadratic Time)之间的时间复杂性?那近似$ ed_a $呢? 我们首先为$ ed_a $提出了一种简单的确定性精确算法,并进一步证明它在正交矢量猜想的情况下几乎是最佳的。我们的主要结果是一个随机算法计算$(1+ε)$ - $ ed_a(x,y)$的近似,给定字符串$ x,总长度$ n $的$ x,y $和bound $ k \ ge ed_a(x,y)$。为简单起见,让我们专注于$ k \ ge 1 $和一个常数$ε> 0 $;然后,我们的算法采用$ \ tilde {o}(n/a + ak^3)$时间。除非$ a = \ tilde {o}(1)$,并且对于足够小的$ k $,则此运行时间为$ n $。我们还考虑了一个非常自然的版本,该版本要求找到$(k_i,k_s)$ - 对齐 - 与最多$ k_i $ indels和$ k_s $替换的对齐方式。在这种情况下,我们给出了一种精确的算法,更重要的是,$ \ tilde {o}(nk_i/k_s + k_s \ cdot k_i^3)$ - time $(1,1 +ε)$ - bicriteria近似算法。后一个解决方案基于我们针对$ a =θ(k_s / k_i)$开发的技术。这些边界与单位成本的编辑距离形成鲜明对比,在那里,最先进的算法远非达成$(1+ε)$ - 近似于sublinear时间的近似值,即使是$ k $的有利选择也是如此。
The edit distance between strings classically assigns unit cost to every character insertion, deletion, and substitution, whereas the Hamming distance only allows substitutions. In many real-life scenarios, insertions and deletions (abbreviated indels) appear frequently but significantly less so than substitutions. To model this, we consider substitutions being cheaper than indels, with cost $1/a$ for a parameter $a\ge 1$. This basic variant, denoted $ED_a$, bridges classical edit distance ($a=1$) with Hamming distance ($a\to\infty$), leading to interesting algorithmic challenges: Does the time complexity of computing $ED_a$ interpolate between that of Hamming distance (linear time) and edit distance (quadratic time)? What about approximating $ED_a$? We first present a simple deterministic exact algorithm for $ED_a$ and further prove that it is near-optimal assuming the Orthogonal Vectors Conjecture. Our main result is a randomized algorithm computing a $(1+ε)$-approximation of $ED_a(X,Y)$, given strings $X,Y$ of total length $n$ and a bound $k\ge ED_a(X,Y)$. For simplicity, let us focus on $k\ge 1$ and a constant $ε> 0$; then, our algorithm takes $\tilde{O}(n/a + ak^3)$ time. Unless $a=\tilde{O}(1)$ and for small enough $k$, this running time is sublinear in $n$. We also consider a very natural version that asks to find a $(k_I, k_S)$-alignment -- an alignment with at most $k_I$ indels and $k_S$ substitutions. In this setting, we give an exact algorithm and, more importantly, an $\tilde{O}(nk_I/k_S + k_S\cdot k_I^3)$-time $(1,1+ε)$-bicriteria approximation algorithm. The latter solution is based on the techniques we develop for $ED_a$ for $a=Θ(k_S / k_I)$. These bounds are in stark contrast to unit-cost edit distance, where state-of-the-art algorithms are far from achieving $(1+ε)$-approximation in sublinear time, even for a favorable choice of $k$.