使用零拍模型选择的语音增强

论文标题

使用零拍模型选择的语音增强

Speech Enhancement with Zero-Shot Model Selection

论文作者

Zezario, Ryandhimas E., Fuh, Chiou-Shann, Wang, Hsin-Min, Tsao, Yu

论文摘要

关于言语增强（SE）的最新研究已经看到了基于深度学习的方法的出现。确定在不同的测试条件下增加SE的普遍性的有效方法仍然是一项艰巨的任务。在这项研究中，我们结合了零拍的学习和集合学习，以提出一种零击模型选择（ZMO）方法，以提高SE性能的概括。提出的方法在离线和在线阶段中实现。离线阶段将整个训练数据集群群群中群中群中，并使用每个子集训练专门的SE模型（称为组件SE模型）。在线阶段选择最合适的组件SE模型来执行增强功能。此外，还制定了两种选择策略：基于质量得分（QS）和基于质量嵌入（QE）的选择。 QS和QE均使用质量网络（非侵入性质量评估网络）获得。实验结果证实，与基线系统和其他模型选择系统相比，所提出的ZMOS方法可以在可见和看不见的噪声类型中获得更好的性能，这表明所提出的方法在提供强大的SE性能方面的有效性。

Recent research on speech enhancement (SE) has seen the emergence of deep-learning-based methods. It is still a challenging task to determine the effective ways to increase the generalizability of SE under diverse test conditions. In this study, we combine zero-shot learning and ensemble learning to propose a zero-shot model selection (ZMOS) approach to increase the generalization of SE performance. The proposed approach is realized in the offline and online phases. The offline phase clusters the entire set of training data into multiple subsets and trains a specialized SE model (termed component SE model) with each subset. The online phase selects the most suitable component SE model to perform the enhancement. Furthermore, two selection strategies were developed: selection based on the quality score (QS) and selection based on the quality embedding (QE). Both QS and QE were obtained using a Quality-Net, a non-intrusive quality assessment network. Experimental results confirmed that the proposed ZMOS approach can achieve better performance in both seen and unseen noise types compared to the baseline systems and other model selection systems, which indicates the effectiveness of the proposed approach in providing robust SE performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题