逐步优化基于可扩展嵌入的检索的双粒文档表示

论文标题

逐步优化基于可扩展嵌入的检索的双粒文档表示

Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

论文作者

Xiao, Shitao, Liu, Zheng, Han, Weihao, Zhang, Jianjin, Shao, Yingxia, Lian, Defu, Li, Chaozhuo, Sun, Hao, Deng, Denvy, Zhang, Liangjie, Zhang, Qi, Xie, Xing

论文摘要

临时搜索要求从大规模的语料库中选择适当的答案。如今，基于嵌入式的检索（EBR）成为一个有前途的解决方案，基于深度学习的文档表示和ANN搜索技术已与此任务相关。但是，一个主要的挑战是，鉴于答案语料库的大小相当大，ANN指数可能太大而无法适应内存。在这项工作中，我们通过双粒文档表示解决了这个问题，其中轻巧的稀疏嵌入式被索引，并在记忆中待命以进行粗粒候选搜索，并且重量级密集的嵌入在磁盘中托管以进行细粒后验证。为了获得最佳的检索精度，设计了一个渐进优化框架。稀疏的嵌入是在提前学习的，以高质量地搜索候选人。在稀疏嵌入诱导的候选分布中，密集的嵌入是不断学习的，以优化与入围候选者的地面真相的歧视。此外，引入了两种技术：对比度量化和以区域性为中心的采样，以学习稀疏和密集的嵌入，这实质上有助于其性能。借助上述功能，我们的方法有效地处理了庞大的EBR，具有强大的准确性： +4.3％的回忆收益增长在百万级语料库上，并且在十亿级语料库上的收回增长高达 +17.5％。此外，我们的方法适用于一个主要的赞助搜索平台，其收入（+1.95％），召回率（+1.01％）和CTR（+0.49％）。我们的代码可在https://github.com/microsoft/bidr上找到。

Ad-hoc search calls for the selection of appropriate answers from a massive-scale corpus. Nowadays, the embedding-based retrieval (EBR) becomes a promising solution, where deep learning based document representation and ANN search techniques are allied to handle this task. However, a major challenge is that the ANN index can be too large to fit into memory, given the considerable size of answer corpus. In this work, we tackle this problem with Bi-Granular Document Representation, where the lightweight sparse embeddings are indexed and standby in memory for coarse-grained candidate search, and the heavyweight dense embeddings are hosted in disk for fine-grained post verification. For the best of retrieval accuracy, a Progressive Optimization framework is designed. The sparse embeddings are learned ahead for high-quality search of candidates. Conditioned on the candidate distribution induced by the sparse embeddings, the dense embeddings are continuously learned to optimize the discrimination of ground-truth from the shortlisted candidates. Besides, two techniques: the contrastive quantization and the locality-centric sampling are introduced for the learning of sparse and dense embeddings, which substantially contribute to their performances. Thanks to the above features, our method effectively handles massive-scale EBR with strong advantages in accuracy: with up to +4.3% recall gain on million-scale corpus, and up to +17.5% recall gain on billion-scale corpus. Besides, Our method is applied to a major sponsored search platform with substantial gains on revenue (+1.95%), Recall (+1.01%) and CTR (+0.49%). Our code is available at https://github.com/microsoft/BiDR.

下载PDF全文

下载文献需遵守相关版权规定

论文标题