论文标题
catplayinginthesnow:先前的分割对视觉接地模型的影响
Catplayinginthesnow: Impact of Prior Segmentation on a Model of Visually Grounded Speech
论文作者
论文摘要
语言获取文献表明,儿童不会通过将口语输入分段到音素中,然后从中构建单词来建立词典,而是采用自上而下的方法,然后首先将类似单词的单位分割,然后将其分解为较小的单元。这表明学习语言的理想方法是从完整的语义单元开始。在本文中,我们研究了在语音图像检索任务中训练的视觉扎根语音的神经模型是否也是如此。我们评估了这样一个网络在提供手机,音节或单词边界信息时能够学习可靠的语音到图像映射的程度。我们提出了一种将此类信息引入基于RNN的模型的简单方法,并研究哪种类型的边界是最有效的。我们还探讨了应该介绍网络体系结构级别的级别,以最大程度地提高其性能。最后,我们表明,在层次结构中立即使用多种边界类型,而低级段用于重建高级片段,这是有益的,并且比在隔离中使用低级或高级段相比会产生更好的结果。
The language acquisition literature shows that children do not build their lexicon by segmenting the spoken input into phonemes and then building up words from them, but rather adopt a top-down approach and start by segmenting word-like units and then break them down into smaller units. This suggests that the ideal way of learning a language is by starting from full semantic units. In this paper, we investigate if this is also the case for a neural model of Visually Grounded Speech trained on a speech-image retrieval task. We evaluated how well such a network is able to learn a reliable speech-to-image mapping when provided with phone, syllable, or word boundary information. We present a simple way to introduce such information into an RNN-based model and investigate which type of boundary is the most efficient. We also explore at which level of the network's architecture such information should be introduced so as to maximise its performances. Finally, we show that using multiple boundary types at once in a hierarchical structure, by which low-level segments are used to recompose high-level segments, is beneficial and yields better results than using low-level or high-level segments in isolation.