论文标题
迈向自动面对面翻译
Towards Automatic Face-to-Face Translation
论文作者
论文摘要
鉴于自动机器翻译系统的最新突破,我们提出了一种新颖的方法,我们将其称为“面对面翻译”。随着当今的数字通信变得越来越视觉,我们认为有必要自动将语言A讲话的人转换为具有逼真的唇部同步的目标语言B的视频。在这项工作中,我们为此问题创建了一个自动管道,并证明了其对多个现实世界应用程序的影响。首先,我们通过将语音和语言的多个现有模块汇总在一起来构建一个工作的语音转换系统。然后,我们通过合并一个新颖的视觉模块Lipgan来朝着“面对面翻译”迈进,以从翻译的音频中产生逼真的说话面孔。在标准LRW测试集中对Lipgan的定量评估表明,它在所有标准指标上的现有方法显着胜过。我们还对面对面的翻译管道进行了多次人工评估,并表明它可以显着改善跨语言的多模式内容的整体用户体验。代码,模型和演示视频可公开可用。 演示视频:https://www.youtube.com/watch?v=AHG6OEI8JF0 代码和模型:https://github.com/rudrabha/lipgan
In light of the recent breakthroughs in automatic machine translation systems, we propose a novel approach that we term as "Face-to-Face Translation". As today's digital communication becomes increasingly visual, we argue that there is a need for systems that can automatically translate a video of a person speaking in language A into a target language B with realistic lip synchronization. In this work, we create an automatic pipeline for this problem and demonstrate its impact on multiple real-world applications. First, we build a working speech-to-speech translation system by bringing together multiple existing modules from speech and language. We then move towards "Face-to-Face Translation" by incorporating a novel visual module, LipGAN for generating realistic talking faces from the translated audio. Quantitative evaluation of LipGAN on the standard LRW test set shows that it significantly outperforms existing approaches across all standard metrics. We also subject our Face-to-Face Translation pipeline, to multiple human evaluations and show that it can significantly improve the overall user experience for consuming and interacting with multimodal content across languages. Code, models and demo video are made publicly available. Demo video: https://www.youtube.com/watch?v=aHG6Oei8jF0 Code and models: https://github.com/Rudrabha/LipGAN