论文标题
对比度图在视频中用于文本分类的多模式模型
Contrastive Graph Multimodal Model for Text Classification in Videos
论文作者
论文摘要
视频中文本信息的提取是迈向视频语义理解的关键一步。它通常涉及两个步骤:(1)文本识别和(2)文本分类。要将文本本地化在视频中,我们可以根据OCR技术采用大量的文本识别方法。但是,据我们所知,没有现有的工作重点是视频文本分类的第二步,这将限制在视频索引和浏览等下游任务的指导上。在本文中,我们是第一个通过融合多模式信息来处理具有挑战性的方案来解决视频文本分类的新任务的人,在这种情况下,不同类型的视频文本可能与各种颜色,未知字体和复杂的布局混淆。此外,我们根据明确提取布局信息来量身定制一个称为CorrelationNet的特定模块来增强特征表示。此外,使用大量未标记的视频,还利用对比度学习来探索样品之间的固有联系。最后,我们从新闻领域(称为Ti-News)构建了一个定义明确的工业数据集,该数据集用于构建和评估视频文本识别和分类应用程序。对Ti-News的广泛实验证明了我们方法的有效性。
The extraction of text information in videos serves as a critical step towards semantic understanding of videos. It usually involved in two steps: (1) text recognition and (2) text classification. To localize texts in videos, we can resort to large numbers of text recognition methods based on OCR technology. However, to our knowledge, there is no existing work focused on the second step of video text classification, which will limit the guidance to downstream tasks such as video indexing and browsing. In this paper, we are the first to address this new task of video text classification by fusing multimodal information to deal with the challenging scenario where different types of video texts may be confused with various colors, unknown fonts and complex layouts. In addition, we tailor a specific module called CorrelationNet to reinforce feature representation by explicitly extracting layout information. Furthermore, contrastive learning is utilized to explore inherent connections between samples using plentiful unlabeled videos. Finally, we construct a new well-defined industrial dataset from the news domain, called TI-News, which is dedicated to building and evaluating video text recognition and classification applications. Extensive experiments on TI-News demonstrate the effectiveness of our method.