用于理解机器阅读理解的BERT微调的成对探针

论文标题

用于理解机器阅读理解的BERT微调的成对探针

A Pairwise Probe for Understanding BERT Fine-Tuning on Machine Reading Comprehension

论文作者

Cai, Jie, Zhu, Zhengzhou, Nie, Ping, Liu, Qian

论文摘要

预训练的模型为许多NLP任务带来了重大改进，并已进行了广泛的分析。但是，对微调对特定任务的影响知之甚少。直觉上，人们可能会同意，预先训练的模型已经学习了单词的语义表示（例如同义词彼此接近），并且微调进一步提高了其功能，需要更复杂的推理（例如，核心分辨率解决，实体边界检测等）。但是，如何在分析和定量上验证这些论点是一项艰巨的任务，很少有作品重点关注该主题。在本文中，受到观察的启发，即大多数探测任务涉及识别匹配的短语对（例如，核心需要匹配实体和代词），我们提出了一个成对的探针来了解机器阅读理解（MRC）任务上的BERT bert仔细调整。具体而言，我们确定了MRC中的五种现象。根据成对探测任务，我们比较了预训练和微调的BERT的每一层隐藏表示的性能。提出的成对探针减轻了因模型训练不准确而分散注意力的问题，并进行了稳健而定量的比较。我们的实验分析得出了高度自信的结论：（1）微调对基本和低级信息以及一般语义任务的影响很小。（2）对于下游任务所需的特定能力，微调的BERT优于预训练的BERT，在第五层之后，此类间隙很明显。

Pre-trained models have brought significant improvements to many NLP tasks and have been extensively analyzed. But little is known about the effect of fine-tuning on specific tasks. Intuitively, people may agree that a pre-trained model already learns semantic representations of words (e.g. synonyms are closer to each other) and fine-tuning further improves its capabilities which require more complicated reasoning (e.g. coreference resolution, entity boundary detection, etc). However, how to verify these arguments analytically and quantitatively is a challenging task and there are few works focus on this topic. In this paper, inspired by the observation that most probing tasks involve identifying matched pairs of phrases (e.g. coreference requires matching an entity and a pronoun), we propose a pairwise probe to understand BERT fine-tuning on the machine reading comprehension (MRC) task. Specifically, we identify five phenomena in MRC. According to pairwise probing tasks, we compare the performance of each layer's hidden representation of pre-trained and fine-tuned BERT. The proposed pairwise probe alleviates the problem of distraction from inaccurate model training and makes a robust and quantitative comparison. Our experimental analysis leads to highly confident conclusions: (1) Fine-tuning has little effect on the fundamental and low-level information and general semantic tasks. (2) For specific abilities required for downstream tasks, fine-tuned BERT is better than pre-trained BERT and such gaps are obvious after the fifth layer.

下载PDF全文

下载文献需遵守相关版权规定

论文标题