带有快速的级联编码器的流式平行换能器梁搜索

论文标题

带有快速的级联编码器的流式平行换能器梁搜索

Streaming parallel transducer beam search with fast-slow cascaded encoders

论文作者

Mahadeokar, Jay, Shi, Yangyang, Li, Ke, Le, Duc, Zhu, Jiedan, Chandra, Vikas, Kalinli, Ozlem, Seltzer, Michael L

论文摘要

在许多语音识别应用中，需要严格的延迟限制的流媒体ASR。为了达到所需的延迟，由于缺乏未来的输入上下文，与非流式ASR模型相比，流媒体模型牺牲了准确性。先前的研究表明，RNN传感器的流和非流动ASR可以通过级联因果和非因果编码器统一。这项工作通过利用两种流式非毒物编码器具有可变输入上下文大小的流媒体编码器来改善，这些框架可以以不同的音频间隔（例如快速而慢）产生输出。我们为从快速缓慢编码器解码的传感器提出了一种新颖的并行时间同步搜索算法，其中慢速编码器纠正了从快速编码器产生的错误。拟议的算法可实现多达20％的降低，而公共Librispeech数据集和内部数据集中的令牌排放延迟略有增加。我们还通过在快速编码器和慢速编码之间分配处理来探讨以减少计算的技术。最后，我们探索在快速编码器中共享参数以减少内存足迹。这使得在边缘设备上具有低计算成本和低内存足迹的低潜伏期处理。

Streaming ASR with strict latency constraints is required in many speech recognition applications. In order to achieve the required latency, streaming ASR models sacrifice accuracy compared to non-streaming ASR models due to lack of future input context. Previous research has shown that streaming and non-streaming ASR for RNN Transducers can be unified by cascading causal and non-causal encoders. This work improves upon this cascaded encoders framework by leveraging two streaming non-causal encoders with variable input context sizes that can produce outputs at different audio intervals (e.g. fast and slow). We propose a novel parallel time-synchronous beam search algorithm for transducers that decodes from fast-slow encoders, where the slow encoder corrects the mistakes generated from the fast encoder. The proposed algorithm, achieves up to 20% WER reduction with a slight increase in token emission delays on the public Librispeech dataset and in-house datasets. We also explore techniques to reduce the computation by distributing processing between the fast and slow encoders. Lastly, we explore sharing the parameters in the fast encoder to reduce the memory footprint. This enables low latency processing on edge devices with low computation cost and a low memory footprint.

下载PDF全文

下载文献需遵守相关版权规定

论文标题