论文标题
要了解标签平滑
Towards Understanding Label Smoothing
论文作者
论文摘要
标签平滑正则化(LSR)在训练深层神经网络(例如随机梯度下降及其变体)方面取得了巨大成功。但是,从优化的角度来看,对其力量的理论理解仍然很少见。这项研究通过启动分析为对LSR的深刻了解打开了大门。在本文中,我们分析了随机梯度下降的收敛行为,标签平滑正规化以解决非凸问题,并表明适当的LSR可以通过降低方差来帮助加快收敛性。更有趣的是,我们提出了一种简单而有效的策略,即两阶段标签平滑算法(TSLA),该算法(TSLA)在早期训练时期使用LSR,并将其放在后来的训练时期。我们从TSLA的改善收敛结果中观察到,它在第一阶段受益于LSR,并且在第二阶段收敛的速度更快。据我们所知,这是通过在非convex优化中与LSR建立随机方法的收敛复杂性,以理解LSR的力量的第一项工作。我们从经验上证明了该方法与基线对基准数据集的训练重新网络模型的有效性。
Label smoothing regularization (LSR) has a great success in training deep neural networks by stochastic algorithms such as stochastic gradient descent and its variants. However, the theoretical understanding of its power from the view of optimization is still rare. This study opens the door to a deep understanding of LSR by initiating the analysis. In this paper, we analyze the convergence behaviors of stochastic gradient descent with label smoothing regularization for solving non-convex problems and show that an appropriate LSR can help to speed up the convergence by reducing the variance. More interestingly, we proposed a simple yet effective strategy, namely Two-Stage LAbel smoothing algorithm (TSLA), that uses LSR in the early training epochs and drops it off in the later training epochs. We observe from the improved convergence result of TSLA that it benefits from LSR in the first stage and essentially converges faster in the second stage. To the best of our knowledge, this is the first work for understanding the power of LSR via establishing convergence complexity of stochastic methods with LSR in non-convex optimization. We empirically demonstrate the effectiveness of the proposed method in comparison with baselines on training ResNet models over benchmark data sets.