无监督图像字幕的经常性关系记忆网络

论文标题

无监督图像字幕的经常性关系记忆网络

Recurrent Relational Memory Network for Unsupervised Image Captioning

论文作者

Guo, Dan, Wang, Yang, Song, Peipei, Wang, Meng

论文摘要

无需注释的无监督图像字幕是计算机视觉中的一个新兴挑战，现有艺术通常采用GAN（生成的对抗网络）模型。在本文中，我们提出了一个新颖的基于内存的网络，而不是GAN，称为Recurrent Recurnation Memoration Network（$ r^2M $）。与不思想的长度句子生成的复杂和敏感的对抗性学习不同，$ r^2M $通过两个阶段的记忆机制实现了概念到句子的记忆转换器：融合和经常性记忆，将常见的视觉概念与长时间的共同视觉概念之间的关系推理相关联。 $ r^2M $通过对图像的无监督培训来编码视觉上下文，同时使记忆能够通过监督时尚从无关的文本语料库中学习。与基于GAN的方法相比，我们的解决方案具有更少的可学习参数和更高的计算效率，而基于GAN的方法极大地承受了参数敏感性。我们通过实验性地验证了所有基准数据集中的$ r^2M $的优势。

Unsupervised image captioning with no annotations is an emerging challenge in computer vision, where the existing arts usually adopt GAN (Generative Adversarial Networks) models. In this paper, we propose a novel memory-based network rather than GAN, named Recurrent Relational Memory Network ($R^2M$). Unlike complicated and sensitive adversarial learning that non-ideally performs for long sentence generation, $R^2M$ implements a concepts-to-sentence memory translator through two-stage memory mechanisms: fusion and recurrent memories, correlating the relational reasoning between common visual concepts and the generated words for long periods. $R^2M$ encodes visual context through unsupervised training on images, while enabling the memory to learn from irrelevant textual corpus via supervised fashion. Our solution enjoys less learnable parameters and higher computational efficiency than GAN-based methods, which heavily bear parameter sensitivity. We experimentally validate the superiority of $R^2M$ than state-of-the-arts on all benchmark datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题