论文标题
微调审计的语言模型:权重初始化,数据订单和早期停止
Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping
论文作者
论文摘要
在自然语言处理中,对审计的上下文嵌入模型嵌入到监督下游任务已成为司空见惯。但是,这个过程通常是脆弱的:即使具有相同的高参数值,不同的随机种子也会导致实质上不同的结果。为了更好地理解这种现象,我们从胶水基准测试中尝试了四个数据集,每种数据集的伯特几百次,同时仅改变随机种子。我们发现与先前报道的结果相比,我们发现大幅度的性能提高,并且我们量化了最佳发现模型的性能如何随微型试验数量而变化。此外,我们研究了由随机种子选择影响的两个因素:重量初始化和训练数据顺序。我们发现,两者都对样本外性能的差异有所贡献,并且在所有探索的所有任务中,某些权重初始化的表现都很好。在小型数据集上,我们观察到许多微调试验通过培训而分开,我们为从业者提供了最佳实践,以便尽早停止训练较少的训练。我们公开发布所有实验数据,包括2100次试验的培训和验证分数,以鼓励在微调过程中进一步分析培训动态。
Fine-tuning pretrained contextual word embedding models to supervised downstream tasks has become commonplace in natural language processing. This process, however, is often brittle: even with the same hyperparameter values, distinct random seeds can lead to substantially different results. To better understand this phenomenon, we experiment with four datasets from the GLUE benchmark, fine-tuning BERT hundreds of times on each while varying only the random seeds. We find substantial performance increases compared to previously reported results, and we quantify how the performance of the best-found model varies as a function of the number of fine-tuning trials. Further, we examine two factors influenced by the choice of random seed: weight initialization and training data order. We find that both contribute comparably to the variance of out-of-sample performance, and that some weight initializations perform well across all tasks explored. On small datasets, we observe that many fine-tuning trials diverge part of the way through training, and we offer best practices for practitioners to stop training less promising runs early. We publicly release all of our experimental data, including training and validation scores for 2,100 trials, to encourage further analysis of training dynamics during fine-tuning.