论文标题
利用预先训练的BERT进行音频字幕
Leveraging Pre-trained BERT for Audio Captioning
论文作者
论文摘要
音频字幕旨在使用自然语言来描述音频剪辑的内容。现有的音频字幕系统通常基于编码器架构,其中音频编码器提取声学信息,然后使用语言解码器来生成字幕。培训音频字幕系统通常会遇到数据稀缺问题。从预先训练的音频模型(例如预训练的音频神经网络(PANN))中转移知识最近已成为减轻此问题的有用方法。但是,与编码器相比,对解码器的预培训语言模型的关注较少。 Bert是一种预先训练的语言模型,已在自然语言处理(NLP)任务中广泛使用。然而,尚未研究伯特作为音频字幕的语言解码器的潜力。在这项研究中,我们证明了预训练的BERT模型在音频字幕上的功效。具体而言,我们将PANN应用于编码器,并从公共预训练的BERT模型中初始化解码器。我们对音频字幕模型中的这些BERT模型使用这些BERT模型的使用进行了实证研究。我们的模型通过AudioCaps数据集上的现有音频字幕方法实现竞争结果。
Audio captioning aims at using natural language to describe the content of an audio clip. Existing audio captioning systems are generally based on an encoder-decoder architecture, in which acoustic information is extracted by an audio encoder and then a language decoder is used to generate the captions. Training an audio captioning system often encounters the problem of data scarcity. Transferring knowledge from pre-trained audio models such as Pre-trained Audio Neural Networks (PANNs) have recently emerged as a useful method to mitigate this issue. However, there is less attention on exploiting pre-trained language models for the decoder, compared with the encoder. BERT is a pre-trained language model that has been extensively used in Natural Language Processing (NLP) tasks. Nevertheless, the potential of BERT as the language decoder for audio captioning has not been investigated. In this study, we demonstrate the efficacy of the pre-trained BERT model for audio captioning. Specifically, we apply PANNs as the encoder and initialize the decoder from the public pre-trained BERT models. We conduct an empirical study on the use of these BERT models for the decoder in the audio captioning model. Our models achieve competitive results with the existing audio captioning methods on the AudioCaps dataset.