论文标题

天然发生的输入和输出的几乎没有开采

Few-shot Mining of Naturally Occurring Inputs and Outputs

论文作者

Joshi, Mandar, Blevins, Terra, Lewis, Mike, Weld, Daniel S., Zettlemoyer, Luke

论文摘要

创建标记的自然语言培训数据很昂贵,需要大量的人为努力。我们使用只有100个示例的小种子培训的有监督的采矿功能,从大型语料库中挖掘了输入示例。挖掘由两个阶段组成 - (1)基于生物编码器的召回式密集搜索,该搜索将输入与潜在输出配对,以及(2)基于交叉编码器的滤波器,将生物编码器阶段的输出重新列入以获得更好的精度。与模型生成的数据增强不同,我们的方法是自然存在的高质量输出输出对,以模仿用于多个任务的种子样式。在小队式的阅读理解中,使用挖掘数据来增强种子集,从而在仅在种子集上进行微调的巴特大基线来改善13 f1。同样,我们在XSUM抽象摘要上看到了1.46 Rouge-L的改进。

Creating labeled natural language training data is expensive and requires significant human effort. We mine input output examples from large corpora using a supervised mining function trained using a small seed set of only 100 examples. The mining consists of two stages -- (1) a biencoder-based recall-oriented dense search which pairs inputs with potential outputs, and (2) a crossencoder-based filter which re-ranks the output of the biencoder stage for better precision. Unlike model-generated data augmentation, our method mines naturally occurring high-quality input output pairs to mimic the style of the seed set for multiple tasks. On SQuAD-style reading comprehension, augmenting the seed set with the mined data results in an improvement of 13 F1 over a BART-large baseline fine-tuned only on the seed set. Likewise, we see improvements of 1.46 ROUGE-L on Xsum abstractive summarization.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源