同时建模语音识别和综合：编码和解码词汇和Sublexical语义信息，无需直接访问语音数据

论文标题

同时建模语音识别和综合：编码和解码词汇和Sublexical语义信息，无需直接访问语音数据

Modeling speech recognition and synthesis simultaneously: Encoding and decoding lexical and sublexical semantic information into speech with no direct access to speech data

论文作者

Beguš, Gašper, Zhou, Alan

论文摘要

人说的人将信息编码为原始语音，然后由听众解码。编码（生产）与解码（感知）之间的这种复杂关系通常是单独建模的。在这里，我们测试了词汇语义信息的编码和解码如何从无监督的生成深度卷积网络中自动出现，这些卷积网络结合了演讲的生产和感知原理。据我们所知，我们在无监督的词汇学习中介绍了最具挑战性的目标：一个必须学习词汇项目的独特表示，而无需直接访问培训数据的网络。我们训练多种模型（Ciwgan和Fiwgan Arxiv：2006.02951），并测试网络如何在未观察到的测试数据中对声学词汇进行分类。有力的证据支持词汇学习以及潜在的代码与有意义的Sublexical单元之间的因果关系。因此，结合生产和感知原则的结构可以学会从原始声学数据中解码独特的信息，而无需直接访问真实的培训数据。我们提出了一种探索词汇（整体）和sublexical（特征）（特征）学习的技术。由于语言模型越来越绕过文本并从原始声学中运行，因此对无监督的语音技术以及无监督的语义建模具有影响的影响。

Human speakers encode information into raw speech which is then decoded by the listeners. This complex relationship between encoding (production) and decoding (perception) is often modeled separately. Here, we test how encoding and decoding of lexical semantic information can emerge automatically from raw speech in unsupervised generative deep convolutional networks that combine the production and perception principles of speech. We introduce, to our knowledge, the most challenging objective in unsupervised lexical learning: a network that must learn unique representations for lexical items with no direct access to training data. We train several models (ciwGAN and fiwGAN arXiv:2006.02951) and test how the networks classify acoustic lexical items in unobserved test data. Strong evidence in favor of lexical learning and a causal relationship between latent codes and meaningful sublexical units emerge. The architecture that combines the production and perception principles is thus able to learn to decode unique information from raw acoustic data without accessing real training data directly. We propose a technique to explore lexical (holistic) and sublexical (featural) learned representations in the classifier network. The results bear implications for unsupervised speech technology, as well as for unsupervised semantic modeling as language models increasingly bypass text and operate from raw acoustics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题