论文标题

音频视听多渠道对重叠的语音的识别

Audio-visual Multi-channel Recognition of Overlapped Speech

论文作者

Yu, Jianwei, Wu, Bo, Gu, Rongzhi, Zhang, Shi-Xiong, Chen, Lianwu, Yu, Yong Xu. Meng, Su, Dan, Yu, Dong, Liu, Xunying, Meng, Helen

论文摘要

迄今为止,重叠语音的自动语音识别(ASR)仍然是一项高度挑战的任务。为此,多通道麦克风阵列数据被广泛用于最新的ASR系统中。本文由视觉模态对声学信号腐败的不变性进行动力,提出了视听多渠道重叠的语音识别系统,具有紧密整合的前端和识别后端。开发了一系列基于\ textIt {tf masking},\ textit {filter \&sum}和\ textit {mask-lastit {mask-lase-basion mvdr} beam forming方法的音频多通道语音分离前端组件。为了减少分离和识别组件之间的错误成本不匹配,使用连接派时间分类(CTC)损失函数或具有比例不变的信号噪声比(SI-SNR)错误成本的多任务标准插值。 Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81\% (26.83\% relative) and 22.22\% (56.87\% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 (LRS2) dataset respectively.

Automatic speech recognition (ASR) of overlapped speech remains a highly challenging task to date. To this end, multi-channel microphone array data are widely used in state-of-the-art ASR systems. Motivated by the invariance of visual modality to acoustic signal corruption, this paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end. A series of audio-visual multi-channel speech separation front-end components based on \textit{TF masking}, \textit{filter\&sum} and \textit{mask-based MVDR} beamforming approaches were developed. To reduce the error cost mismatch between the separation and recognition components, they were jointly fine-tuned using the connectionist temporal classification (CTC) loss function, or a multi-task criterion interpolation with scale-invariant signal to noise ratio (Si-SNR) error cost. Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81\% (26.83\% relative) and 22.22\% (56.87\% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 (LRS2) dataset respectively.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源