伯特的输出层识别所有隐藏层？一些有趣的现象和一种提升伯特的简单方法

论文标题

伯特的输出层识别所有隐藏层？一些有趣的现象和一种提升伯特的简单方法

BERT's output layer recognizes all hidden layers? Some Intriguing Phenomena and a simple way to boost BERT

论文作者

Kao, Wei-Tsung, Wu, Tsung-Han, Chi, Po-Han, Hsieh, Chun-Cheng, Lee, Hung-Yi

论文摘要

尽管来自变形金刚（BERT）的双向编码器表示在许多自然语言处理（NLP）任务中取得了巨大的成功，但它仍然是黑匣子。以前的各种作品试图抬高伯特的面纱并了解每一层的功能。在本文中，我们发现BERT的输出层可以直接将BERT作为输入来重建输入句子，即使输出层除了最终的隐藏层以外的输入层。在各种基于BERT的模型中，即使某些层被复制，这一事实仍然是正确的。基于此观察，我们提出了一种非常简单的方法来提高BERT的性能。通过在基于BERT的模型中复制一些层以使其更深入（在此步骤中不需要额外的培训），在微调后，它们可以在下游任务中获得更好的性能。

Although Bidirectional Encoder Representations from Transformers (BERT) have achieved tremendous success in many natural language processing (NLP) tasks, it remains a black box. A variety of previous works have tried to lift the veil of BERT and understand each layer's functionality. In this paper, we found that surprisingly the output layer of BERT can reconstruct the input sentence by directly taking each layer of BERT as input, even though the output layer has never seen the input other than the final hidden layer. This fact remains true across a wide variety of BERT-based models, even when some layers are duplicated. Based on this observation, we propose a quite simple method to boost the performance of BERT. By duplicating some layers in the BERT-based models to make it deeper (no extra training required in this step), they obtain better performance in the downstream tasks after fine-tuning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题