支撑基础：融合模型嵌入和弱监督

论文标题

支撑基础：融合模型嵌入和弱监督

Shoring Up the Foundations: Fusing Model Embeddings and Weak Supervision

论文作者

Chen, Mayee F., Fu, Daniel Y., Adila, Dyah, Zhang, Michael, Sala, Frederic, Fatahalian, Kayvon, Ré, Christopher

论文摘要

基础模型为使用开箱即用的嵌入和一些标签示例构建模型提供了令人兴奋的新范式。但是，尚不清楚如何在没有标记数据的情况下最好地应用基础模型。一种潜在的方法是将基础模型与弱监督框架融合在一起，该框架使用弱标签来源（预训练的模型，启发式方法，人群工人）来构建伪标记。挑战是建立一个最能利用基础模型和弱来源中可用信号的组合。我们提出了Liger，这是一种使用基础模型嵌入来改善现有弱监督技术的两个关键要素的组合。首先，我们通过分区嵌入空间和学习源精度来产生较弱的源质量估计。其次，我们通过扩展嵌入空间中的源票来提高源覆盖范围。尽管基础模型具有黑盒的性质，但我们证明了表征我们的方法如何提高性能的结果，并证明了通过在嵌入空间中标签分布的平滑度来表明升力尺度。在六个基准的NLP和视频任务上，Liger的表现优于香草弱监督，弱监督的KNN和适配器的弱监督和适配器的表现为11.8分，而传统手工标签监督的KNN和适配器则为7.2分。

Foundation models offer an exciting new paradigm for constructing models with out-of-the-box embeddings and a few labeled examples. However, it is not clear how to best apply foundation models without labeled data. A potential approach is to fuse foundation models with weak supervision frameworks, which use weak label sources -- pre-trained models, heuristics, crowd-workers -- to construct pseudolabels. The challenge is building a combination that best exploits the signal available in both foundation models and weak sources. We propose Liger, a combination that uses foundation model embeddings to improve two crucial elements of existing weak supervision techniques. First, we produce finer estimates of weak source quality by partitioning the embedding space and learning per-part source accuracies. Second, we improve source coverage by extending source votes in embedding space. Despite the black-box nature of foundation models, we prove results characterizing how our approach improves performance and show that lift scales with the smoothness of label distributions in embedding space. On six benchmark NLP and video tasks, Liger outperforms vanilla weak supervision by 14.1 points, weakly-supervised kNN and adapters by 11.8 points, and kNN and adapters supervised by traditional hand labels by 7.2 points.

下载PDF全文

下载文献需遵守相关版权规定

论文标题