论文标题
文本回声取消
Textual Echo Cancellation
论文作者
论文摘要
在本文中,我们提出了文本回声取消(TEC) - 从重叠的语音记录中取消文本到语音(TTS)播放回声的框架。这样的系统可以在很大程度上可以改善智能设备(例如智能扬声器)的语音识别性能和用户体验,因为在设备仍在播放TTS信号时,用户可以与设备进行对话,以响应上一个查询。我们通过使用新型的序列到序列模型来实现该系统,并具有多源的注意,该模型将麦克风混合信号和TTS播放的源文本作为输入,并预测增强的音频。实验表明,TTS播放的文本信息对于增强性能至关重要。此外,与TTS播放的原始声学信号相比,文本序列的大小要小得多,并且甚至可以在合成播放之前立即传输到设备或ASR服务器。因此,与诸如声学回声取消(AEC)等替代方法相比,我们提出的方法有效地减少了互联网通信和潜伏期。
In this paper, we propose Textual Echo Cancellation (TEC) - a framework for cancelling the text-to-speech (TTS) playback echo from overlapping speech recordings. Such a system can largely improve speech recognition performance and user experience for intelligent devices such as smart speakers, as the user can talk to the device while the device is still playing the TTS signal responding to the previous query. We implement this system by using a novel sequence-to-sequence model with multi-source attention that takes both the microphone mixture signal and source text of the TTS playback as inputs, and predicts the enhanced audio. Experiments show that the textual information of the TTS playback is critical to enhancement performance. Besides, the text sequence is much smaller in size compared with the raw acoustic signal of the TTS playback, and can be immediately transmitted to the device or ASR server even before the playback is synthesized. Therefore, our proposed approach effectively reduces Internet communication and latency compared with alternative approaches such as acoustic echo cancellation (AEC).