了解训练变压器的困难

论文标题

了解训练变压器的困难

Understanding the Difficulty of Training Transformers

论文作者

Liu, Liyuan, Liu, Xiaodong, Gao, Jianfeng, Chen, Weizhu, Han, Jiawei

论文摘要

变形金刚在许多NLP任务中已证明有效。但是，他们的培训需要在设计尖端优化器和学习率调度程序方面进行非平凡的努力（例如，常规SGD无法有效地训练变形金刚）。我们的目的是了解$ \ textit {是从经验和理论观点均使变压器培训} $复杂化的。我们的分析表明，不平衡的梯度不是训练不稳定的根本原因。取而代之的是，我们确定了一种基本影响训练的扩增效应 - 对于多层变压器模型中的每一层，对其残留分支的依赖性很大使训练变得不稳定，因为它会放大小参数扰动（例如，参数更新），并导致模型输出的重大干扰。然而，我们观察到光依赖性限制了模型的潜力，并导致训练有素的模型。受分析的启发，我们建议admin（$ \ textbf {ad} $ aptive $ \ textbf {m} $ odel $ \ textbf {in} $ itialization）稳定稳定早期阶段的训练，并在后期释放其全部潜力。广泛的实验表明，管理员更稳定，收敛速度更快，并且可以提高性能更好。实现的发布：https：//github.com/liyuanlucasliu/transforemr-clinic。

Transformers have proved effective in many NLP tasks. However, their training requires non-trivial efforts regarding designing cutting-edge optimizers and learning rate schedulers carefully (e.g., conventional SGD fails to train Transformers effectively). Our objective here is to understand $\textit{what complicates Transformer training}$ from both empirical and theoretical perspectives. Our analysis reveals that unbalanced gradients are not the root cause of the instability of training. Instead, we identify an amplification effect that influences training substantially -- for each layer in a multi-layer Transformer model, heavy dependency on its residual branch makes training unstable, since it amplifies small parameter perturbations (e.g., parameter updates) and results in significant disturbances in the model output. Yet we observe that a light dependency limits the model potential and leads to inferior trained models. Inspired by our analysis, we propose Admin ($\textbf{Ad}$aptive $\textbf{m}$odel $\textbf{in}$itialization) to stabilize stabilize the early stage's training and unleash its full potential in the late stage. Extensive experiments show that Admin is more stable, converges faster, and leads to better performance. Implementations are released at: https://github.com/LiyuanLucasLiu/Transforemr-Clinic.

下载PDF全文

下载文献需遵守相关版权规定

论文标题