论文标题
基于注意力和编码器模型,用于以不同的口语速度转换发音运动
Attention and Encoder-Decoder based models for transforming articulatory movements at different speaking rates
论文作者
论文摘要
在以不同的速度讲话时,枢纽(如舌头,嘴唇)的移动趋势也有所不同,并且发音也有不同的持续时间。过去,仿射转化和DNN已被用来转化从中性到快速(N2F),中性转变为慢速(N2S)的口语率[1]。在这项工作中,我们通过对速率特定的持续时间进行建模及其使用ASTNET的转换来改进现有的转换技术,这是一个引起注意的编码器编码器框架。在当前的工作中,我们使用LSTMS提出了一个编码器架构,该结构生成更平滑的预测发音轨迹。为了建模跨语言率的持续时间变化,我们部署了注意网络,这消除了使用DTW以不同速率保持轨迹的需求。我们执行音素特定的持续时间分析,以检查使用拟议的ASTNET如何转化持续时间。由于交通动作的范围与口语速度相关,因此我们还分析了与原始速度相比,以不同速率转化的关节运动的振幅,以研究拟议的ASTNET预测N2F和N2S中的发音运动的程度。我们观察到,与现有的转换技术相比,ASTNET可以更好地模拟关节运动的持续时间和范围,从而导致更准确的转换关节轨迹。
While speaking at different rates, articulators (like tongue, lips) tend to move differently and the enunciations are also of different durations. In the past, affine transformation and DNN have been used to transform articulatory movements from neutral to fast(N2F) and neutral to slow(N2S) speaking rates [1]. In this work, we improve over the existing transformation techniques by modeling rate specific durations and their transformation using AstNet, an encoder-decoder framework with attention. In the current work, we propose an encoder-decoder architecture using LSTMs which generates smoother predicted articulatory trajectories. For modeling duration variations across speaking rates, we deploy attention network, which eliminates the needto align trajectories in different rates using DTW. We performa phoneme specific duration analysis to examine how well duration is transformed using the proposed AstNet. As the range of articulatory motions is correlated with speaking rate, we also analyze amplitude of the transformed articulatory movements at different rates compared to their original counterparts, to examine how well the proposed AstNet predicts the extent of articulatory movements in N2F and N2S. We observe that AstNet could model both duration and extent of articulatory movements better than the existing transformation techniques resulting in more accurate transformed articulatory trajectories.