I代码：一个综合且可组合的多模式学习框架

论文标题

I代码：一个综合且可组合的多模式学习框架

i-Code: An Integrative and Composable Multimodal Learning Framework

论文作者

Yang, Ziyi, Fang, Yuwei, Zhu, Chenguang, Pryzant, Reid, Chen, Dongdong, Shi, Yu, Xu, Yichong, Qian, Yao, Gao, Mei, Chen, Yi-Ling, Lu, Liyang, Xie, Yujia, Gmyr, Robert, Codella, Noel, Kanda, Naoyuki, Xiao, Bin, Yuan, Lu, Yoshioka, Takuya, Zeng, Michael, Huang, Xuedong

论文摘要

人类智能是多模式的；我们整合视觉，语言和声学信号以保持整体世界观。但是，大多数当前的预训练方法仅限于一种或两种方式。我们提出了I-Code，这是一个自我监督的预处理框架，用户可以灵活地将视觉，语音和语言的方式结合到统一和通用的矢量表示形式中。在此框架中，首先将每个模态的数据提供给验证的单模式编码器。然后将编码器输出与多模式融合网络集成在一起，该网络使用新颖的注意机制和其他建筑创新来有效地结合来自不同模式的信息。整个系统都是端到端的，具有新的目标，包括掩盖模态单元建模和跨模式对比度学习。与以前仅使用视频进行预处理的研究不同，I代码框架可以在训练和推理过程中动态处理单个，二和三模式数据，从而灵活地将不同模态的组合投影到单个表示空间中。实验结果表明，I代码如何在五个视频理解任务和胶水NLP基准上胜过最先进的技术，并提高了多达11％，并证明了综合多模式预处理的能力。

Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited to one or two modalities. We present i-Code, a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations. In this framework, data from each modality are first given to pretrained single-modality encoders. The encoder outputs are then integrated with a multimodal fusion network, which uses novel attention mechanisms and other architectural innovations to effectively combine information from the different modalities. The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning. Unlike previous research using only video for pretraining, the i-Code framework can dynamically process single, dual, and triple-modality data during training and inference, flexibly projecting different combinations of modalities into a single representation space. Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11% and demonstrating the power of integrative multimodal pretraining.

下载PDF全文

下载文献需遵守相关版权规定

论文标题