论文标题
混音:半监督文本分类的隐藏空间的语言信息插值
MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification
论文作者
论文摘要
本文介绍了一种用于文本分类的半监督学习方法MixText,该方法使用了我们新设计的数据增强方法,称为TMIX。 TMIX通过在隐藏空间中插值文本来创建大量的增强培训样本。此外,我们利用数据增强的最新进展来猜测无标记的数据的低透镜标签,因此使它们像标记的数据一样易于使用。通过混合标记,未标记和增强的数据,MixText显着超过了当前的预先触发和罚款的模型,并享受了额外的额定模型,以及其他尚未进行过分罚款的型号,并在几个文本上进行了几种文字化方法。当监督极为有限时,改善尤为明显。我们已在https://github.com/gt-salt/mixtext上公开发布了代码。
This paper presents MixText, a semi-supervised learning method for text classification, which uses our newly designed data augmentation method called TMix. TMix creates a large amount of augmented training samples by interpolating text in hidden space. Moreover, we leverage recent advances in data augmentation to guess low-entropy labels for unlabeled data, hence making them as easy to use as labeled data.By mixing labeled, unlabeled and augmented data, MixText significantly outperformed current pre-trained and fined-tuned models and other state-of-the-art semi-supervised learning methods on several text classification benchmarks. The improvement is especially prominent when supervision is extremely limited. We have publicly released our code at https://github.com/GT-SALT/MixText.