描述要更改的内容：文本引导的无监督图像到图像翻译方法

论文标题

描述要更改的内容：文本引导的无监督图像到图像翻译方法

Describe What to Change: A Text-guided Unsupervised Image-to-Image Translation Approach

论文作者

Liu, Yahui, De Nadai, Marco, Cai, Deng, Li, Huayang, Alameda-Pineda, Xavier, Sebe, Nicu, Lepri, Bruno

论文摘要

通过人写的文本操纵图像的视觉属性是一项非常具有挑战性的任务。一方面，模型必须在没有所需输出的基本真理的情况下学习操作。另一方面，模型必须处理自然语言的固有歧义。以前的研究通常要求用户描述所需图像的所有特征，或使用丰富的图像字幕数据集。在这项工作中，我们根据图像到图像翻译提出了一种新颖的无监督方法，该方法通过像命令般的句子（例如“将头发颜色更改为黑色”更改为“将头发颜色””来改变给定图像的属性。与最先进的方法相反，我们的模型不需要人类宣传的数据集，也不需要对所需图像的所有属性的文本描述，而只需要进行修改。我们提出的模型将图像内容从视觉属性中删除，并在从内容和修改的属性表示中生成新图像之前，学会了使用文本描述修改后者。因为文本可能固有地模棱两可（金发可能是指金发的不同阴影，例如金色，冰冷，桑迪），所以我们的方法生成了同一翻译的多个随机版本。实验表明，所提出的模型在两个大规模公共数据集上实现了有希望的表演：Celeba和Cub。我们认为，我们的方法将为将文本和语音命令与视觉属性结合的研究途径铺平道路。

Manipulating visual attributes of images through human-written text is a very challenging task. On the one hand, models have to learn the manipulation without the ground truth of the desired output. On the other hand, models have to deal with the inherent ambiguity of natural language. Previous research usually requires either the user to describe all the characteristics of the desired image or to use richly-annotated image captioning datasets. In this work, we propose a novel unsupervised approach, based on image-to-image translation, that alters the attributes of a given image through a command-like sentence such as "change the hair color to black". Contrarily to state-of-the-art approaches, our model does not require a human-annotated dataset nor a textual description of all the attributes of the desired image, but only those that have to be modified. Our proposed model disentangles the image content from the visual attributes, and it learns to modify the latter using the textual description, before generating a new image from the content and the modified attribute representation. Because text might be inherently ambiguous (blond hair may refer to different shadows of blond, e.g. golden, icy, sandy), our method generates multiple stochastic versions of the same translation. Experiments show that the proposed model achieves promising performances on two large-scale public datasets: CelebA and CUB. We believe our approach will pave the way to new avenues of research combining textual and speech commands with visual attributes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题