论文标题

使用位置序列模式来估计SQL像查询的选择性

Using Positional Sequence Patterns to Estimate the Selectivity of SQL LIKE Queries

论文作者

Aytimur, Mehmet, Cakmak, Ali

论文摘要

随着通常包含拼写错误和其他错误的基于文本数据的数量的急剧增加,使用灵活的搜索模式查询此类数据变得越来越普遍。关系数据库支持类似运算符,以允许使用特定的通配符谓词进行搜索(例如,像“ sub%”,它以“ sub”开头匹配所有字符串)。由于文本数据的范围很大,因此以最佳方式执行此类查询对于数据库性能至关重要。在为类似查询的最有效的执行计划构建最有效的执行计划时,查询优化器需要对基于模式的查询谓词进行选择性估算。最近,提出了SPH算法,该算法采用基于序列模式的直方图结构来估计类似查询的选择性。 SPH方法的缺点是它通常高估了查询的选择性。为了减轻高估问题,在本文中,我们提出了一种新型的序列模式类型,称为位置序列模式。所提出的模式区分了序列项对,这些序列对在所有模式中彼此相邻出现的模式对与可能在它们之间有其他项目的序列分化。此外,我们根据直方图构建过程中的模式信息内容采用冗余模式消除。最后,我们在选择性估计期间提出了一个基于分区的匹配方案。 DBLP的真实数据集上的实验结果表明,所提出的方法的表现优于ART的错误率提高了20%。

With the dramatic increase in the amount of the text-based data which commonly contains misspellings and other errors, querying such data with flexible search patterns becomes more and more commonplace. Relational databases support the LIKE operator to allow searching with a particular wildcard predicate (e.g., LIKE 'Sub%', which matches all strings starting with 'Sub'). Due to the large size of text data, executing such queries in the most optimal way is quite critical for database performance. While building the most efficient execution plan for a LIKE query, the query optimizer requires the selectivity estimate for the flexible pattern-based query predicate. Recently, SPH algorithm is proposed which employs a sequence pattern-based histogram structure to estimate the selectivity of LIKE queries. A drawback of the SPH approach is that it often overestimates the selectivity of queries. In order to alleviate the overestimation problem, in this paper, we propose a novel sequence pattern type, called positional sequence patterns. The proposed patterns differentiate between sequence item pairs that appear next to each other in all pattern occurrences from those that may have other items between them. Besides, we employ redundant pattern elimination based on pattern information content during histogram construction. Finally, we propose a partitioning-based matching scheme during the selectivity estimation. The experimental results on a real dataset from DBLP show that the proposed approach outperforms the state of the art by around 20% improvement in error rates.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源