论文标题

对学术布鲁姆过滤器中分类器选择的批判性分析

A Critical Analysis of Classifier Selection in Learned Bloom Filters

论文作者

Malchiodi, Dario, Raimondi, Davide, Fumagalli, Giacomo, Giancarlo, Raffaele, Frasca, Marco

论文摘要

最近引入了学习的Bloom过滤器,即通过机器学习技术从数据引起的模型并解决近似设置的成员资格问题,目的是提高标准Bloom过滤器的性能,特别关注太空占用。与经典情况不同,用于构建过滤器的数据的“复杂性”可能会对其性能产生重大影响。因此,在这里,我们提出了第一个深入的分析,据我们所知,在给定的分类复杂性的数据集上,给定学习的Bloom过滤器的性能评估,并与给定的分类器结合使用。的确,我们提出了一种由软件支持的新方法,用于设计,分析和实施对其多标准性质的特定约束功能(即,涉及空间效率,假阳性率和拒绝时间的约束),在功能方面发挥了博学的过滤器。我们的实验表明,所提出的方法和支持软件是有效且有用的:我们发现,只有两个分类器就具有不同数据复杂性的问题而言具有理想的属性,有趣的是,到目前为止,在文献中都没有被认为。我们还通过实验表明,学到的Bloom过滤器的夹层变体对数据复杂性和分类器性能变异性以及通常具有较小的拒绝时间的变体是最强大的。该软件可以很容易地用于测试新的Bloom滤清器提案,可以将其与此处确定的最好的滤清器进行比较。

Learned Bloom Filters, i.e., models induced from data via machine learning techniques and solving the approximate set membership problem, have recently been introduced with the aim of enhancing the performance of standard Bloom Filters, with special focus on space occupancy. Unlike in the classical case, the "complexity" of the data used to build the filter might heavily impact on its performance. Therefore, here we propose the first in-depth analysis, to the best of our knowledge, for the performance assessment of a given Learned Bloom Filter, in conjunction with a given classifier, on a dataset of a given classification complexity. Indeed, we propose a novel methodology, supported by software, for designing, analyzing and implementing Learned Bloom Filters in function of specific constraints on their multi-criteria nature (that is, constraints involving space efficiency, false positive rate, and reject time). Our experiments show that the proposed methodology and the supporting software are valid and useful: we find out that only two classifiers have desirable properties in relation to problems with different data complexity, and, interestingly, none of them has been considered so far in the literature. We also experimentally show that the Sandwiched variant of Learned Bloom filters is the most robust to data complexity and classifier performance variability, as well as those usually having smaller reject times. The software can be readily used to test new Learned Bloom Filter proposals, which can be compared with the best ones identified here.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源