论文标题

时间:文本和图像相互翻译对抗网络

TIME: Text and Image Mutual-Translation Adversarial Networks

论文作者

Liu, Bingchen, Song, Kunpeng, Zhu, Yizhe, de Melo, Gerard, Elgammal, Ahmed

论文摘要

专注于文本图像(T2I)生成,我们提出了文本和图像相互译本对抗网络(TIME),这是一个轻巧但有效的模型,在生成的对抗网络框架下共同学习T2i Generator G和图像图像贴图歧视器D。虽然以前的方法将T2I问题作为单向任务解决,并使用预训练的语言模型来执行图像 - 文本一致性,但时间既不需要额外的模块也不需要预训练。我们表明,通过与D作为语言模型共同训练G可以大大提高G的性能。具体而言,我们采用变压器来对图像特征和单词嵌入之间的跨模式连接进行建模,并设计一种退火有条件的铰链损失,以动态平衡对抗性学习。在我们的实验中,时间在CUB和MS-COCO数据集上达到了最新的(SOTA)性能(Inception评分为4.91,Fréchet成立为14.3),并且在图像字幕上显示了MS-COCO上的有希望的表现,并显示出图像字幕字幕和下游视觉语言任务。

Focusing on text-to-image (T2I) generation, we propose Text and Image Mutual-Translation Adversarial Networks (TIME), a lightweight but effective model that jointly learns a T2I generator G and an image captioning discriminator D under the Generative Adversarial Network framework. While previous methods tackle the T2I problem as a uni-directional task and use pre-trained language models to enforce the image--text consistency, TIME requires neither extra modules nor pre-training. We show that the performance of G can be boosted substantially by training it jointly with D as a language model. Specifically, we adopt Transformers to model the cross-modal connections between the image features and word embeddings, and design an annealing conditional hinge loss that dynamically balances the adversarial learning. In our experiments, TIME achieves state-of-the-art (SOTA) performance on the CUB and MS-COCO dataset (Inception Score of 4.91 and Fréchet Inception Distance of 14.3 on CUB), and shows promising performance on MS-COCO on image captioning and downstream vision-language tasks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源