论文标题
WENET 2.0:更有生产力的端到端语音识别工具包
WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit
论文作者
论文摘要
最近,我们提供了一种面向生产的端到端语音识别工具包Wenet,该工具包引入了统一的两通道(U2)框架和内置运行时,以解决单个模型中的流和非流传输模式。为了进一步提高ASR性能并促进各种生产要求,在本文中,我们提出了Wenet 2.0,并提供四个重要的更新。 (1)我们提出了一个带有双向注意解码器的统一的两次通行框架U2 ++,其中包括未来的上下文信息,通过向右的注意力解码器提高共享编码器和在夺回阶段的性能的代表性能力。 (2)我们将基于N-Gram的语言模型和基于WFST的解码器引入WENET 2.0,从而促进了在生产方案中使用丰富的文本数据。 (3)我们设计了一个统一的上下文偏见框架,该框架利用特定用户的上下文(例如联系人列表)为生产提供快速适应能力,并提高了使用LM和没有LM场景的ASR准确性。 (4)我们设计了一个统一的IO,以支持大规模数据进行有效的模型培训。总而言之,与原始WENET相比,全新的WENET 2.0可实现高达10 \%的相对识别性能改善,并提供几个重要的面向生产的功能。
Recently, we made available WeNet, a production-oriented end-to-end speech recognition toolkit, which introduces a unified two-pass (U2) framework and a built-in runtime to address the streaming and non-streaming decoding modes in a single model. To further improve ASR performance and facilitate various production requirements, in this paper, we present WeNet 2.0 with four important updates. (1) We propose U2++, a unified two-pass framework with bidirectional attention decoders, which includes the future contextual information by a right-to-left attention decoder to improve the representative ability of the shared encoder and the performance during the rescoring stage. (2) We introduce an n-gram based language model and a WFST-based decoder into WeNet 2.0, promoting the use of rich text data in production scenarios. (3) We design a unified contextual biasing framework, which leverages user-specific context (e.g., contact lists) to provide rapid adaptation ability for production and improves ASR accuracy in both with-LM and without-LM scenarios. (4) We design a unified IO to support large-scale data for effective model training. In summary, the brand-new WeNet 2.0 achieves up to 10\% relative recognition performance improvement over the original WeNet on various corpora and makes available several important production-oriented features.