论文标题

AV TARIS:在线视听语音识别

AV Taris: Online Audio-Visual Speech Recognition

论文作者

Sterpu, George, Harte, Naomi

论文摘要

近年来,在相对干净的聆听条件下,自动语音识别(ASR)技术已在对话言语上取得了人力水平的表现。在涉及远距离麦克风,语音重叠,背景噪声或自然对话结构的更苛刻的情况下,ASR错误率至少更高。语音的视觉方式具有部分克服这些挑战的潜力,并有助于说话者腹泻的子任务,语音活动检测和发音地点的恢复,并可以平均补偿高达15dB的噪声。本文开发了AV Taris,这是一个完全可区分的神经网络模型,能够实时解码视听语音。我们通过连接两个最近提出的视听语音整合和在线语音识别的模型,即Av Align和Taris。我们在与Av Align和Taris相同的条件下评估AV Taris,这是最大的公开视听语音数据集之一LRS2。我们的结果表明,AV Taris优于Taris的仅音频变体,这表明了在Taris定义的实时解码框架内,视觉模态与语音识别的实用性。与基于等效的变压器的AV对齐模型相比,该模型利用完整的句子而不满足实时要求,我们报告了AV Taris的绝对降解约为3%。与更流行的在线语音识别的替代方案,即RNN传感器,Taris提供了非常简化的完全可区分的培训管道。结果,AV Taris有可能普及采用视听语音识别(AVSR)技术,并在不太最佳的听力条件下克服音频方式的固有局限性。

In recent years, Automatic Speech Recognition (ASR) technology has approached human-level performance on conversational speech under relatively clean listening conditions. In more demanding situations involving distant microphones, overlapped speech, background noise, or natural dialogue structures, the ASR error rate is at least an order of magnitude higher. The visual modality of speech carries the potential to partially overcome these challenges and contribute to the sub-tasks of speaker diarisation, voice activity detection, and the recovery of the place of articulation, and can compensate for up to 15dB of noise on average. This article develops AV Taris, a fully differentiable neural network model capable of decoding audio-visual speech in real time. We achieve this by connecting two recently proposed models for audio-visual speech integration and online speech recognition, namely AV Align and Taris. We evaluate AV Taris under the same conditions as AV Align and Taris on one of the largest publicly available audio-visual speech datasets, LRS2. Our results show that AV Taris is superior to the audio-only variant of Taris, demonstrating the utility of the visual modality to speech recognition within the real time decoding framework defined by Taris. Compared to an equivalent Transformer-based AV Align model that takes advantage of full sentences without meeting the real-time requirement, we report an absolute degradation of approximately 3% with AV Taris. As opposed to the more popular alternative for online speech recognition, namely the RNN Transducer, Taris offers a greatly simplified fully differentiable training pipeline. As a consequence, AV Taris has the potential to popularise the adoption of Audio-Visual Speech Recognition (AVSR) technology and overcome the inherent limitations of the audio modality in less optimal listening conditions.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源