我们到了吗？用密集检索系统替换基于术语的检索的决策框架

论文标题

我们到了吗？用密集检索系统替换基于术语的检索的决策框架

Are We There Yet? A Decision Framework for Replacing Term Based Retrieval with Dense Retrieval Systems

论文作者

Hofstätter, Sebastian, Craswell, Nick, Mitra, Bhaskar, Zamani, Hamed, Hanbury, Allan

论文摘要

最近，几种密集的检索（DR）模型已经证明了基于术语的检索的竞争性能，这些绩效在搜索系统中无处不在。与基于术语的匹配相反，DR将查询和文档投射到密集的矢量空间中，并通过（大约）最近的邻居搜索检索结果。部署新系统（例如DR）不可避免地会涉及其性能方面的权衡。通常，建立的检索系统按照有效性和成本（例如查询延迟，索引吞吐量或存储要求）对较大的检索系统进行了充分的理解。在这项工作中，我们提出了一个具有一组标准的框架，这些框架超出了简单的有效性措施，可以彻底比较两个检索系统，并明确目标，即评估一个系统的准备就绪以取代另一个系统。这包括有效性和各种成本因素之间的仔细权衡考虑。此外，我们描述了护栏标准，因为即使是平均而言更好的系统，也可能会对少数查询有系统的失败。护栏检查某些查询特性和新型故障类型的故障，这些故障只有在密集的检索系统中才有可能。我们在网络排名方案上演示了我们的决策框架。在这种情况下，最先进的DR模型的结果令人惊讶，不仅是平均表现，而且通过一系列广泛的护栏测试，表现出不同的查询特性，词汇匹配，概括和回归次数的鲁棒性。无法预测将来博士是否会变得无处不在，但是这是一种可能的方法是通过重复应用决策过程（例如此处介绍的过程）。

Recently, several dense retrieval (DR) models have demonstrated competitive performance to term-based retrieval that are ubiquitous in search systems. In contrast to term-based matching, DR projects queries and documents into a dense vector space and retrieves results via (approximate) nearest neighbor search. Deploying a new system, such as DR, inevitably involves tradeoffs in aspects of its performance. Established retrieval systems running at scale are usually well understood in terms of effectiveness and costs, such as query latency, indexing throughput, or storage requirements. In this work, we propose a framework with a set of criteria that go beyond simple effectiveness measures to thoroughly compare two retrieval systems with the explicit goal of assessing the readiness of one system to replace the other. This includes careful tradeoff considerations between effectiveness and various cost factors. Furthermore, we describe guardrail criteria, since even a system that is better on average may have systematic failures on a minority of queries. The guardrails check for failures on certain query characteristics and novel failure types that are only possible in dense retrieval systems. We demonstrate our decision framework on a Web ranking scenario. In that scenario, state-of-the-art DR models have surprisingly strong results, not only on average performance but passing an extensive set of guardrail tests, showing robustness on different query characteristics, lexical matching, generalization, and number of regressions. It is impossible to predict whether DR will become ubiquitous in the future, but one way this is possible is through repeated applications of decision processes such as the one presented here.

下载PDF全文

下载文献需遵守相关版权规定

论文标题