Lafurca：基于上下文感知的双路平行BI-LSTM的迭代精制语音分离

论文标题

Lafurca：基于上下文感知的双路平行BI-LSTM的迭代精制语音分离

LaFurca: Iterative Refined Speech Separation Based on Context-Aware Dual-Path Parallel Bi-LSTM

论文作者

Shi, Ziqiang, Liu, Rujie, Han, Jiqing

论文摘要

已证明具有双向双向长期短期记忆（BILSTM）块的深度神经网络已被证明在序列建模中非常有效，尤其是在语音分离中，例如dprnn-tasnet \ cite {luo2019Dual}。在本文中，我们提出了基于双路径Bilstm网络的几个改进，以端到端的界面语音分离方法。首先，引入了具有并联内Bilstm和并行比尔斯特组件的双路径网络，以减少不同分支之间的性能子数字。其次，我们建议使用全局上下文意识到的跨平行bilstm来进一步感知全局上下文信息。最后，提出了一个螺旋多阶段双路径Bilstm，以迭代地完善前阶段的分离结果。所有这些网络都采用了两个演讲者的混合话语，并将其映射到两个单独的话语中，每个话语都只包含一个说话者的声音。为了实现目标，我们建议通过直接优化置换不变训练（PIT）样式的话语水平量表不变的信噪比（SI-SDR）来训练网络。我们对公共WSJ0-2MIX数据语料库进行的实验会导致20.55DB SDR改进，20.35DB SI-SDR改进，PESQ的3.69和94.86 \％ESTOI，这表明我们提出的网络可以在扬声器分离任务上导致绩效改善。我们已经在https://github.com/shiziqiang/dual-path-path-rnns-dprnns-基于speech-eparparation中开放了对dprnn-tasnet的重新实现，我们的lafurca是基于DPRNN-TASNET的实施来实现的，因此可以将其置于本文中。

Deep neural network with dual-path bi-directional long short-term memory (BiLSTM) block has been proved to be very effective in sequence modeling, especially in speech separation, e.g. DPRNN-TasNet \cite{luo2019dual}. In this paper, we propose several improvements of dual-path BiLSTM based network for end-to-end approach to monaural speech separation. Firstly a dual-path network with intra-parallel BiLSTM and inter-parallel BiLSTM components is introduced to reduce performance sub-variances among different branches. Secondly, we propose to use global context aware inter-intra cross-parallel BiLSTM to further perceive the global contextual information. Finally, a spiral multi-stage dual-path BiLSTM is proposed to iteratively refine the separation results of the previous stages. All these networks take the mixed utterance of two speakers and map it to two separate utterances, where each utterance contains only one speaker's voice. For the objective, we propose to train the network by directly optimizing the utterance level scale-invariant signal-to-distortion ratio (SI-SDR) in a permutation invariant training (PIT) style. Our experiments on the public WSJ0-2mix data corpus results in 20.55dB SDR improvement, 20.35dB SI-SDR improvement, 3.69 of PESQ, and 94.86\% of ESTOI, which shows our proposed networks can lead to performance improvement on the speaker separation task. We have open-sourced our re-implementation of the DPRNN-TasNet in https://github.com/ShiZiqiang/dual-path-RNNs-DPRNNs-based-speech-separation, and our LaFurca is realized based on this implementation of DPRNN-TasNet, it is believed that the results in this paper can be reproduced with ease.

下载PDF全文

下载文献需遵守相关版权规定

论文标题