论文标题
关于专家稀疏混合物的代表崩溃
On the Representation Collapse of Sparse Mixture of Experts
论文作者
论文摘要
专家的稀疏混合物提供了更大的模型容量,同时需要持续的计算开销。它采用路由机制根据其隐藏的表示,将输入令牌分发给最匹配的专家。但是,学习这种路由机制鼓励了围绕专家质心的令牌聚类,这意味着代表崩溃的趋势。在这项工作中,我们建议估计令牌与低维超球的专家之间的路由得分。我们对跨语性语言模型预训练和下游任务进行微调进行了广泛的实验。七个多语言基准测试的实验结果表明,我们的方法可以达到一致的增长。我们还对模型的表示和路由行为进行了全面分析。我们的方法减轻了表示的崩溃问题,并且比基线混合物的方法更加一致。
Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead. It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations. However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse. In this work, we propose to estimate the routing scores between tokens and experts on a low-dimensional hypersphere. We conduct extensive experiments on cross-lingual language model pre-training and fine-tuning on downstream tasks. Experimental results across seven multilingual benchmarks show that our method achieves consistent gains. We also present a comprehensive analysis on the representation and routing behaviors of our models. Our method alleviates the representation collapse issue and achieves more consistent routing than the baseline mixture-of-experts methods.