多模式预处理，用于密集的视频字幕

论文标题

多模式预处理，用于密集的视频字幕

Multimodal Pretraining for Dense Video Captioning

论文作者

Huang, Gabriel, Pang, Bo, Zhu, Zhenhai, Rivera, Clara, Soricut, Radu

论文摘要

通过教学视频越来越多地学习诸如烹饪，汽车维护和房屋维修之类的特定动手技能。众所周知，使用此类视频的用户体验可以通过元信息（例如所涉及的主要步骤的时间戳记注释）来提高。自动产生这种注释是具有挑战性的，我们在这里描述了两个相关的贡献。首先，我们构建并发布了一个新的密集视频字幕数据集，视频时间表标签（VITT），其中包含各种教学视频以及时间戳记的注释。其次，我们探讨了几种多模式序列到序列预审计的策略，这些策略利用了大量的视频和字幕式文本的无监督数据集。我们使用YouCook2和Vitt预先预先限制了Finetune密集的视频字幕模型。我们表明，此类模型可以很好地概括，并且在各种教学视频上都很强大。

Learning specific hands-on skills such as cooking, car maintenance, and home repairs increasingly happens via instructional videos. The user experience with such videos is known to be improved by meta-information such as time-stamped annotations for the main steps involved. Generating such annotations automatically is challenging, and we describe here two relevant contributions. First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations. Second, we explore several multimodal sequence-to-sequence pretraining strategies that leverage large unsupervised datasets of videos and caption-like texts. We pretrain and subsequently finetune dense video captioning models using both YouCook2 and ViTT. We show that such models generalize well and are robust over a wide variety of instructional videos.

下载PDF全文

下载文献需遵守相关版权规定

论文标题