通过三重信息瓶颈无监督的语音分解

论文标题

通过三重信息瓶颈无监督的语音分解

Unsupervised Speech Decomposition via Triple Information Bottleneck

论文作者

Qian, Kaizhi, Zhang, Yang, Chang, Shiyu, Cox, David, Hasegawa-Johnson, Mark

论文摘要

语音信息可以大致分解为四个组成部分：语言内容，音色，音调和节奏。在许多语音分析和发电应用中，获得这些组件的分离表示。最近，最先进的语音转换系统导致语音表示，这些语音表示可以分解说话者依赖于扬声器和独立的信息。但是，这些系统只能散布音色，而有关音高，节奏和内容的信息仍然混合在一起。在没有明确注释的情况下，要获得的每个组件都很难且昂贵，进一步删除剩余的语音组成部分是一个不确定的问题。在本文中，我们提出了Speaksplit，通过引入三个精心设计的信息瓶颈，可以将语音盲目地分解为其四个组成部分。 SpeechSplit是第一个可以在没有文本标签的情况下在音色，音调和节奏上分别执行样式转移的算法之一。我们的代码可在https://github.com/auspious3000/speechsplit上公开获取。

Speech information can be roughly decomposed into four components: language content, timbre, pitch, and rhythm. Obtaining disentangled representations of these components is useful in many speech analysis and generation applications. Recently, state-of-the-art voice conversion systems have led to speech representations that can disentangle speaker-dependent and independent information. However, these systems can only disentangle timbre, while information about pitch, rhythm and content is still mixed together. Further disentangling the remaining speech components is an under-determined problem in the absence of explicit annotations for each component, which are difficult and expensive to obtain. In this paper, we propose SpeechSplit, which can blindly decompose speech into its four components by introducing three carefully designed information bottlenecks. SpeechSplit is among the first algorithms that can separately perform style transfer on timbre, pitch and rhythm without text labels. Our code is publicly available at https://github.com/auspicious3000/SpeechSplit.

下载PDF全文

下载文献需遵守相关版权规定

论文标题