论文标题

可索引创始人块图的线性时间构建

Linear Time Construction of Indexable Founder Block Graphs

论文作者

Mäkinen, Veli, Cazaux, Bastien, Equi, Massimo, Norri, Tuukka, Tomescu, Alexandru I.

论文摘要

我们基于最佳分割概念引入了紧凑的pangenome表示形式,该概念旨在从多序列比对(MSA)重建创始人序列。这样的创始人序列具有这样的特征,即MSA的每一行都是创始人的重组。以前已经设计了几种线性时间动态编程算法,以优化诱导创始人块的分割,然后将其串联成一组创始人序列。所有可能的串联订单都可以表示为创始人块图。我们观察到这样的图形的关键属性:如果节点标签(创建者段)在图路径的路径中不重复,则可以为有效的字符串匹配而索引此类图。我们称此类图段段重复启动块图。 我们给出了一个线性时间算法,以构建给定MSA的段重复创建器块图。该算法结合了创始人分割算法(Cazaux等人2019)的技术和全功能双向毛毛毛轮索引(Belazzougui and Cunial,CPM 2019)。我们得出一个简洁的索引结构,以支持图路径中任意长度的查询。 据报道,在SAR-COV-2菌株的MSA上进行了实验。 $ 410 \ times 29811 $的MSA在一分钟内将3900个节点和4440个边缘的Repoy-Repoy Founder Block图压实。节点标签的最大长度和总长度分别为12和34968。图上的索引仅占MSA大小的$ 3 \%$。

We introduce a compact pangenome representation based on an optimal segmentation concept that aims to reconstruct founder sequences from a multiple sequence alignment (MSA). Such founder sequences have the feature that each row of the MSA is a recombination of the founders. Several linear time dynamic programming algorithms have been previously devised to optimize segmentations that induce founder blocks that then can be concatenated into a set of founder sequences. All possible concatenation orders can be expressed as a founder block graph. We observe a key property of such graphs: if the node labels (founder segments) do not repeat in the paths of the graph, such graphs can be indexed for efficient string matching. We call such graphs segment repeat-free founder block graphs. We give a linear time algorithm to construct a segment repeat-free founder block graph given an MSA. The algorithm combines techniques from the founder segmentation algorithms (Cazaux et al. SPIRE 2019) and fully-functional bidirectional Burrows-Wheeler index (Belazzougui and Cunial, CPM 2019). We derive a succinct index structure to support queries of arbitrary length in the paths of the graph. Experiments on an MSA of SAR-CoV-2 strains are reported. An MSA of size $410\times 29811$ is compacted in one minute into a segment repeat-free founder block graph of 3900 nodes and 4440 edges. The maximum length and total length of node labels is 12 and 34968, respectively. The index on the graph takes only $3\%$ of the size of the MSA.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源