混音：半监督文本分类的隐藏空间的语言信息插值

论文标题

混音：半监督文本分类的隐藏空间的语言信息插值

MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification

论文作者

Chen, Jiaao, Yang, Zichao, Yang, Diyi

论文摘要

本文介绍了一种用于文本分类的半监督学习方法MixText，该方法使用了我们新设计的数据增强方法，称为TMIX。 TMIX通过在隐藏空间中插值文本来创建大量的增强培训样本。此外，我们利用数据增强的最新进展来猜测无标记的数据的低透镜标签，因此使它们像标记的数据一样易于使用。通过混合标记，未标记和增强的数据，MixText显着超过了当前的预先触发和罚款的模型，并享受了额外的额定模型，以及其他尚未进行过分罚款的型号，并在几个文本上进行了几种文字化方法。当监督极为有限时，改善尤为明显。我们已在https://github.com/gt-salt/mixtext上公开发布了代码。

This paper presents MixText, a semi-supervised learning method for text classification, which uses our newly designed data augmentation method called TMix. TMix creates a large amount of augmented training samples by interpolating text in hidden space. Moreover, we leverage recent advances in data augmentation to guess low-entropy labels for unlabeled data, hence making them as easy to use as labeled data.By mixing labeled, unlabeled and augmented data, MixText significantly outperformed current pre-trained and fined-tuned models and other state-of-the-art semi-supervised learning methods on several text classification benchmarks. The improvement is especially prominent when supervision is extremely limited. We have publicly released our code at https://github.com/GT-SALT/MixText.

下载PDF全文

下载文献需遵守相关版权规定

论文标题