论文标题
基于变形流的两流网络用于唇读
Deformation Flow Based Two-Stream Network for Lip Reading
论文作者
论文摘要
嘴唇阅读是通过分析人们说话时在唇部地区分析演讲的任务。观察口语过程中相邻帧的连续性,以及不同扬声器发音相同音素时的运动模式的一致性,我们将语言过程中的唇部运动模拟为唇部区域的一系列明显变形。具体而言,我们引入了一个变形流网络(DFN),以学习相邻帧之间的变形流,该帧直接捕获唇部区域内的运动信息。然后将学习的变形流与原始的灰度框架结合使用,该灰度框架具有两流网络以执行唇部读数。与以前的两流网络不同,我们通过引入双向知识蒸馏损失来共同训练两个分支,使这两个流在学习过程中相互学习。由于不同分支提供的互补提示,两流网络比使用任何一个分支都有实质性的改进。对两个大规模的唇读基准进行了彻底的实验评估,并进行了详细的分析。结果符合我们的动力,并表明我们的方法在这两个具有挑战性的数据集上实现了最先进或可比的性能。
Lip reading is the task of recognizing the speech content by analyzing movements in the lip region when people are speaking. Observing on the continuity in adjacent frames in the speaking process, and the consistency of the motion patterns among different speakers when they pronounce the same phoneme, we model the lip movements in the speaking process as a sequence of apparent deformations in the lip region. Specifically, we introduce a Deformation Flow Network (DFN) to learn the deformation flow between adjacent frames, which directly captures the motion information within the lip region. The learned deformation flow is then combined with the original grayscale frames with a two-stream network to perform lip reading. Different from previous two-stream networks, we make the two streams learn from each other in the learning process by introducing a bidirectional knowledge distillation loss to train the two branches jointly. Owing to the complementary cues provided by different branches, the two-stream network shows a substantial improvement over using either single branch. A thorough experimental evaluation on two large-scale lip reading benchmarks is presented with detailed analysis. The results accord with our motivation, and show that our method achieves state-of-the-art or comparable performance on these two challenging datasets.