使用LSTM语音模型改进MVDR束形式，以清洁空间聚类面膜

论文标题

使用LSTM语音模型改进MVDR束形式，以清洁空间聚类面膜

Improved MVDR Beamforming Using LSTM Speech Models to Clean Spatial Clustering Masks

论文作者

Ni, Zhaoheng, Grezes, Felix, Trinh, Viet Anh, Mandel, Michael I.

论文摘要

空间聚类技术可以在相对任意的麦克风配置上实现大量的多渠道降噪功能，但难以合并详细的语音/噪声模型。相比之下，LSTM神经网络已成功培训，以识别单渠道输入上噪声中的语音，但很难充分利用多渠道录音中的信息。本文集成了这两种方法，即训练LSTM语音模型，以清洁由基于模型的EM源分离和定位（MESSL）空间聚类方法产生的面具。通过这样做，它既可以达到多通道空间聚类的空间分离性能和一般性，以及多个平行单渠道LSTM语音增强器的信号建模性能。我们的实验表明，当我们的系统应用于噪声片记录的Chime-3数据集时，它会提高语音质量，如语音质量（PESQ）算法的感知评估所衡量的，并降低了基线基-3语音识别器的单词错误率（与默认的default-beam beam formformformer相比）。

Spatial clustering techniques can achieve significant multi-channel noise reduction across relatively arbitrary microphone configurations, but have difficulty incorporating a detailed speech/noise model. In contrast, LSTM neural networks have successfully been trained to recognize speech from noise on single-channel inputs, but have difficulty taking full advantage of the information in multi-channel recordings. This paper integrates these two approaches, training LSTM speech models to clean the masks generated by the Model-based EM Source Separation and Localization (MESSL) spatial clustering method. By doing so, it attains both the spatial separation performance and generality of multi-channel spatial clustering and the signal modeling performance of multiple parallel single-channel LSTM speech enhancers. Our experiments show that when our system is applied to the CHiME-3 dataset of noisy tablet recordings, it increases speech quality as measured by the Perceptual Evaluation of Speech Quality (PESQ) algorithm and reduces the word error rate of the baseline CHiME-3 speech recognizer, as compared to the default BeamformIt beamformer.

下载PDF全文

下载文献需遵守相关版权规定

论文标题