拍摄笔记有助于预训练

论文标题

拍摄笔记有助于预训练

Taking Notes on the Fly Helps BERT Pre-training

论文作者

Wu, Qiyu, Xing, Chen, Li, Yatao, Ke, Guolin, He, Di, Liu, Tie-Yan

论文摘要

如何使无监督的语言预先培训更有效，资源密集型是NLP的重要研究方向。在本文中，我们专注于通过提供更好的数据利用来提高语言预训练方法的效率。众所周知，在语言数据语料库中，单词遵循重尾分布。很大一部分单词仅出现很少的次，稀有单词的嵌入通常被优化的很差。我们认为，这种嵌入具有不充分的语义信号，这可能会使数据利用效率低下并减慢整个模型的预训练。为了减轻此问题，我们提出了动态记录（TNF），该笔记在预训练期间会随时拍摄稀有词的笔记，以帮助模型下次发生时了解它们。具体而言，TNF保留了一个音符字典，并在句子中罕见单词时保存了稀有单词的上下文信息。当训练期间再次发生相同的稀有词时，可以提前保存的注释信息来增强当前句子的语义。通过这样做，TNF提供了更好的数据利用，因为使用跨句子信息来涵盖句子中罕见单词引起的语义不足。我们在BERT和Electra上实施TNF，以检查其效率和有效性。实验结果表明，在达到相同的性能时，TNF的训练时间比其骨干前训练模型低60美元。当经过相同数量的迭代训练时，TNF在大多数下游任务和平均胶水分数上都优于其主链方法。源代码附加在补充材料中。

How to make unsupervised language pre-training more efficient and less resource-intensive is an important research direction in NLP. In this paper, we focus on improving the efficiency of language pre-training methods through providing better data utilization. It is well-known that in language data corpus, words follow a heavy-tail distribution. A large proportion of words appear only very few times and the embeddings of rare words are usually poorly optimized. We argue that such embeddings carry inadequate semantic signals, which could make the data utilization inefficient and slow down the pre-training of the entire model. To mitigate this problem, we propose Taking Notes on the Fly (TNF), which takes notes for rare words on the fly during pre-training to help the model understand them when they occur next time. Specifically, TNF maintains a note dictionary and saves a rare word's contextual information in it as notes when the rare word occurs in a sentence. When the same rare word occurs again during training, the note information saved beforehand can be employed to enhance the semantics of the current sentence. By doing so, TNF provides better data utilization since cross-sentence information is employed to cover the inadequate semantics caused by rare words in the sentences. We implement TNF on both BERT and ELECTRA to check its efficiency and effectiveness. Experimental results show that TNF's training time is $60\%$ less than its backbone pre-training models when reaching the same performance. When trained with the same number of iterations, TNF outperforms its backbone methods on most of downstream tasks and the average GLUE score. Source code is attached in the supplementary material.

下载PDF全文

下载文献需遵守相关版权规定

论文标题