论文标题
顶级$ k $连续发生的字符串索引
String Indexing for Top-$k$ Close Consecutive Occurrences
论文作者
论文摘要
经典的字符串索引问题是将字符串$ s $用于紧凑的数据结构中,该数据结构支持有效的后续图案匹配查询,即给定模式字符串$ p $,报告$ s $内的所有出现$ p $。在本文中,我们研究了字符串索引的基本自然扩展,称为弦索引,用于顶部$ k $接近连续发生问题(SITCCO)。在这里,连续发生的是一对$(i,j)$,$ i <j $,因此$ p $在位置$ i $和$ s $ in $ s $的位置$ i $和$ j $都出现,并且在$ i $ $ $ j $之间没有$ p $,并且它们的距离定义为$ j-i $。鉴于图案$ p $和参数$ k $,目标是在$ s $ s $ of Minimal距离中连续报告$ p $的最高$ k $。面临的挑战是紧凑代表$ s $,同时支持及时的查询,接近$ p $和$ k $。我们为这个问题提供了三个时间空间的权衡。让$ n $是$ s $,$ m $的$ p $的长度,$ p $,$ε\ in(0,1] $。我们的第一个结果成就$ o(n \ log n)$太空和$ o(m+k)$的最佳查询时间。我们的第二和第三结果实现了线性空间和查询时间$ o(m+k^^^^^{1+^^{1+^^{1+β} $ o(MOR) \ log^{1+ε} n)$。
The classic string indexing problem is to preprocess a string $S$ into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string $P$, report all occurrences of $P$ within $S$. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-$k$ close consecutive occurrences problem (SITCCO). Here, a consecutive occurrence is a pair $(i,j)$, $i < j$, such that $P$ occurs at positions $i$ and $j$ in $S$ and there is no occurrence of $P$ between $i$ and $j$, and their distance is defined as $j-i$. Given a pattern $P$ and a parameter $k$, the goal is to report the top-$k$ consecutive occurrences of $P$ in $S$ of minimal distance. The challenge is to compactly represent $S$ while supporting queries in time close to the length of $P$ and $k$. We give three time-space trade-offs for the problem. Let $n$ be the length of $S$, $m$ the length of $P$, and $ε\in(0,1]$. Our first result achieves $O(n\log n)$ space and optimal query time of $O(m+k)$. Our second and third results achieve linear space and query times either $O(m+k^{1+ε})$ or $O(m + k \log^{1+ε} n)$. Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees.