论文标题
思考视频字幕的幻觉
Thinking Hallucination for Video Captioning
论文作者
论文摘要
随着丰富的视觉表示和预训练的语言模型的出现,随着时间的推移,视频字幕持续改进。尽管性能有所提高,但视频字幕模型还是容易发生幻觉的。幻觉是指与原始材料分离的高度病理描述的产生。在视频字幕中,有两种幻觉:物体和动作幻觉。我们没有努力学习视频的更好表示,而是在这项工作中研究了幻觉问题的基本来源。我们确定了三个主要因素:(i)从预训练的模型中提取的视觉特征不足,(ii)多模式融合过程中源和目标环境的影响不当,以及(iii)训练策略中的暴露偏见。为了减轻这些问题,我们提出了两种可靠的解决方案:(a)在提取的视觉特征的顶部以多标签设置进行了训练的辅助头,以及(b)添加上下文门,在融合过程中动态选择特征。视频字幕的标准评估指标衡量与地面真相标题的相似性,并且不能充分捕获对象和动作相关性。为此,我们提出了一个新的指标Coaha(标题对象和动作幻觉评估),该指标评估了幻觉的程度。我们的方法在MSR-Video到文本(MSR-VTT)和Microsoft研究视频描述语料库(MSVD)数据集上实现了最新的性能,尤其是通过大量的苹果酒得分。
With the advent of rich visual representations and pre-trained language models, video captioning has seen continuous improvement over time. Despite the performance improvement, video captioning models are prone to hallucination. Hallucination refers to the generation of highly pathological descriptions that are detached from the source material. In video captioning, there are two kinds of hallucination: object and action hallucination. Instead of endeavoring to learn better representations of a video, in this work, we investigate the fundamental sources of the hallucination problem. We identify three main factors: (i) inadequate visual features extracted from pre-trained models, (ii) improper influences of source and target contexts during multi-modal fusion, and (iii) exposure bias in the training strategy. To alleviate these problems, we propose two robust solutions: (a) the introduction of auxiliary heads trained in multi-label settings on top of the extracted visual features and (b) the addition of context gates, which dynamically select the features during fusion. The standard evaluation metrics for video captioning measures similarity with ground truth captions and do not adequately capture object and action relevance. To this end, we propose a new metric, COAHA (caption object and action hallucination assessment), which assesses the degree of hallucination. Our method achieves state-of-the-art performance on the MSR-Video to Text (MSR-VTT) and the Microsoft Research Video Description Corpus (MSVD) datasets, especially by a massive margin in CIDEr score.