TF-Gridnet：整合语音分离的全面和子频段建模

论文标题

TF-Gridnet：整合语音分离的全面和子频段建模

TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation

论文作者

Wang, Zhong-Qiu, Cornell, Samuele, Choi, Shukjae, Lee, Younglo, Kim, Byeong-Yeol, Watanabe, Shinji

论文摘要

我们提出了用于语音分离的TF-Gridnet。该模型是一个新型的深神经网络（DNN），该神经网络（DNN）在时间频率（T-F）域中整合了全面和子频段建模。它堆叠了几个块，每个块由框架内的全频段模块，子频段颞模块和跨框架自我发场模块组成。训练它可以执行复杂的光谱映射，其中将输入信号的真实和虚构（RI）组件堆叠为预测目标RI组件的特征。我们首先将其评估在单声道扬声器的分离上。在不使用数据增强和动态混合的情况下，它可以在WSJ0-2MIX上获得最先进的23.5 dB信噪比（SI-SDR）的23.5 dB改进，这是两种扬声器分离的标准数据集。为了显示其对噪音和混响的稳健性，我们使用SMS-WSJ数据集和使用Whamr！吵闹的扬声器分离来对其进行评估，并在两个数据集中获得最先进的性能。然后，我们将TF-Gridnet扩展到多微晶的条件，通过多微晶酮复杂的光谱映射，并将其整合到两次射击系统之间（较早的研究中称为Miso-BF-Miso），本文中提出的Beam Former在本文中提出的是新颖的Wiener Filter计算的新型Wiener滤波器计算了第一个DNN的输出。在SMS-WSJ和Whamr！的多渠道任务上获得了最先进的性能。除了说话者的分离之外，我们还将提出的算法应用于语音编织和吵闹的语音增强。最新的性能是在Dereverberation数据集和最近的L3DAS22多通道语音增强挑战的数据集上获得的。

We propose TF-GridNet for speech separation. The model is a novel deep neural network (DNN) integrating full- and sub-band modeling in the time-frequency (T-F) domain. It stacks several blocks, each consisting of an intra-frame full-band module, a sub-band temporal module, and a cross-frame self-attention module. It is trained to perform complex spectral mapping, where the real and imaginary (RI) components of input signals are stacked as features to predict target RI components. We first evaluate it on monaural anechoic speaker separation. Without using data augmentation and dynamic mixing, it obtains a state-of-the-art 23.5 dB improvement in scale-invariant signal-to-distortion ratio (SI-SDR) on WSJ0-2mix, a standard dataset for two-speaker separation. To show its robustness to noise and reverberation, we evaluate it on monaural reverberant speaker separation using the SMS-WSJ dataset and on noisy-reverberant speaker separation using WHAMR!, and obtain state-of-the-art performance on both datasets. We then extend TF-GridNet to multi-microphone conditions through multi-microphone complex spectral mapping, and integrate it into a two-DNN system with a beamformer in between (named as MISO-BF-MISO in earlier studies), where the beamformer proposed in this paper is a novel multi-frame Wiener filter computed based on the outputs of the first DNN. State-of-the-art performance is obtained on the multi-channel tasks of SMS-WSJ and WHAMR!. Besides speaker separation, we apply the proposed algorithms to speech dereverberation and noisy-reverberant speech enhancement. State-of-the-art performance is obtained on a dereverberation dataset and on the dataset of the recent L3DAS22 multi-channel speech enhancement challenge.

下载PDF全文

下载文献需遵守相关版权规定

论文标题