论文标题
通过单数矢量正交正规化和奇异价值稀疏来学习低级别的深神经网络
Learning Low-rank Deep Neural Networks via Singular Vector Orthogonality Regularization and Singular Value Sparsification
论文作者
论文摘要
现代深层神经网络(DNN)通常需要高度的记忆消耗和较大的计算负载。为了在边缘或移动设备上有效部署DNN算法,已经探索了一系列DNN压缩算法,包括分解方法。分解方法与两个或多个低级矩阵的乘法近似DNN层的重量矩阵。但是,在训练过程中很难衡量DNN层的排名。以前的工作主要通过隐式近似或在每个训练步骤中通过昂贵的奇异价值分解(SVD)过程诱导低级。前一种方法通常会引起高精度损失,而后者的效率较低。在这项工作中,我们提出了SVD培训,这是在训练期间明确实现低级DNN的第一种方法,而无需在每个步骤上应用SVD。 SVD训练首先将每一层分解为其全等级SVD的形式,然后直接对分解的重量进行训练。我们将正交性添加到奇异向量,从而确保有效的SVD形式,并避免梯度消失/爆炸。通过在每一层的奇异值上应用稀疏性诱导的正规化器来鼓励低级别。最终应用奇异的值修剪以明确达到低级别模型。我们从经验上表明,在相同精度下,SVD训练可以显着降低DNN层的排名,并在相同精度下降低计算负载,这不仅与先前的分解方法相比,还与最先进的过滤器修剪方法相比。
Modern deep neural networks (DNNs) often require high memory consumption and large computational loads. In order to deploy DNN algorithms efficiently on edge or mobile devices, a series of DNN compression algorithms have been explored, including factorization methods. Factorization methods approximate the weight matrix of a DNN layer with the multiplication of two or multiple low-rank matrices. However, it is hard to measure the ranks of DNN layers during the training process. Previous works mainly induce low-rank through implicit approximations or via costly singular value decomposition (SVD) process on every training step. The former approach usually induces a high accuracy loss while the latter has a low efficiency. In this work, we propose SVD training, the first method to explicitly achieve low-rank DNNs during training without applying SVD on every step. SVD training first decomposes each layer into the form of its full-rank SVD, then performs training directly on the decomposed weights. We add orthogonality regularization to the singular vectors, which ensure the valid form of SVD and avoid gradient vanishing/exploding. Low-rank is encouraged by applying sparsity-inducing regularizers on the singular values of each layer. Singular value pruning is applied at the end to explicitly reach a low-rank model. We empirically show that SVD training can significantly reduce the rank of DNN layers and achieve higher reduction on computation load under the same accuracy, comparing to not only previous factorization methods but also state-of-the-art filter pruning methods.