Queryd：具有高质量文本和音频叙述的视频数据集

论文标题

Queryd：具有高质量文本和音频叙述的视频数据集

QuerYD: A video dataset with high-quality text and audio narrations

论文作者

Oncescu, Andreea-Maria, Henriques, João F., Liu, Yang, Zisserman, Andrew, Albanie, Samuel

论文摘要

我们介绍了Queryd，这是一个新的大型数据集，用于检索和事件在视频中的本地化。我们数据集的一个独特功能是每个视频的两个音轨可用性：原始音频和视觉内容的高质量说明。该数据集基于YouDeScribe，这是一个志愿者项目，通过将声音叙述附加到现有的YouTube视频中，可以为视觉障碍的人们提供帮助。这种不断增长的视频集包含高度详细的，时间对齐的音频和文字注释。内容描述比对话更相关，比以前的描述尝试更详细，可以观察到包含许多肤浅或不信息描述。为了演示Queryd数据集的实用性，我们证明它可用于训练和基准测试强型模型以进行检索和事件本地化。数据，代码和模型已公开可用，我们希望Queryd通过书面和口语自然语言激发了对视频理解的进一步研究。

We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video. A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description of the visual content. The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos. This ever-growing collection of videos contains highly detailed, temporally aligned audio and text annotations. The content descriptions are more relevant than dialogue, and more detailed than previous description attempts, which can be observed to contain many superficial or uninformative descriptions. To demonstrate the utility of the QuerYD dataset, we show that it can be used to train and benchmark strong models for retrieval and event localisation. Data, code and models are made publicly available, and we hope that QuerYD inspires further research on video understanding with written and spoken natural language.

下载PDF全文

下载文献需遵守相关版权规定

论文标题