论文标题
整数草图上的动态相似性搜索
Dynamic Similarity Search on Integer Sketches
论文作者
论文摘要
具有相似性的散列是用于快速相似性搜索的核心技术,并且它随机将公制空间中的数据点映射到锤子空间中的离散符号(即草图)的字符串。尽管传统的哈希技术产生了二进制草图,但最近的草图产生了整数草图,以保留各种相似性度量。但是,大多数相似性搜索方法都是为二进制草图而设计的,而整数草图效率低下。此外,尽管现代现实世界数据集随时间更新,但大多数方法对于动态数据集而言是不适用或效率低下的。我们提出了动态滤波器Trie(Dyft),这是二进制和整数草图的动态相似性搜索方法。使用大型现实世界数据集进行的广泛的实验分析表明,在可扩展性,时间性能和记忆效率方面,Dyft的性能优越。例如,在一个拥有2.16亿个数据点的庞大数据集中,Dyft执行的相似性搜索速度比最先进的方法快6,000倍,同时将记忆降低到三分之一。
Similarity-preserving hashing is a core technique for fast similarity searches, and it randomly maps data points in a metric space to strings of discrete symbols (i.e., sketches) in the Hamming space. While traditional hashing techniques produce binary sketches, recent ones produce integer sketches for preserving various similarity measures. However, most similarity search methods are designed for binary sketches and inefficient for integer sketches. Moreover, most methods are either inapplicable or inefficient for dynamic datasets, although modern real-world datasets are updated over time. We propose dynamic filter trie (DyFT), a dynamic similarity search method for both binary and integer sketches. An extensive experimental analysis using large real-world datasets shows that DyFT performs superiorly with respect to scalability, time performance, and memory efficiency. For example, on a huge dataset of 216 million data points, DyFT performs a similarity search 6,000 times faster than a state-of-the-art method while reducing to one-thirteenth in memory.