使用封闭式复发单元的音频字幕

论文标题

使用封闭式复发单元的音频字幕

Audio Captioning using Gated Recurrent Units

论文作者

Eren, Ayşegül Özkaya, Sert, Mustafa

论文摘要

音频字幕是最近提出的任务，用于自动生成给定音频剪辑的文本描述。在这项研究中，提出了带有音频嵌入的新型深网架构，以预测音频字幕。在提取音频功能之外，除了登录MEL Energies之外，VGGISH音频嵌入模型还用于探索音频字幕任务中音频嵌入的可用性。提出的体系结构分别编码音频和文本输入方式，并在解码阶段之前结合它们。音频编码是通过双向门控复发单元（BIGRU）进行的，而GRU用于文本编码阶段。此后，我们通过新发布的音频字幕性能数据集（即Clotho）评估我们的模型，以将实验结果与文献进行比较。我们的实验结果表明，拟议的基于BIGRU的深模型优于最先进的结果。

Audio captioning is a recently proposed task for automatically generating a textual description of a given audio clip. In this study, a novel deep network architecture with audio embeddings is presented to predict audio captions. Within the aim of extracting audio features in addition to log Mel energies, VGGish audio embedding model is used to explore the usability of audio embeddings in the audio captioning task. The proposed architecture encodes audio and text input modalities separately and combines them before the decoding stage. Audio encoding is conducted through Bi-directional Gated Recurrent Unit (BiGRU) while GRU is used for the text encoding phase. Following this, we evaluate our model by means of the newly published audio captioning performance dataset, namely Clotho, to compare the experimental results with the literature. Our experimental results show that the proposed BiGRU-based deep model outperforms the state of the art results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题