DSE-GAN：用于文本到图像生成的动态语义进化生成对抗网络

论文标题

DSE-GAN：用于文本到图像生成的动态语义进化生成对抗网络

DSE-GAN: Dynamic Semantic Evolution Generative Adversarial Network for Text-to-Image Generation

论文作者

Huang, Mengqi, Mao, Zhendong, Wang, Penghui, Wang, Quan, Zhang, Yongdong

论文摘要

文本到图像生成旨在生成与给定文本一致的真实图像。以前的工作主要通过堆叠生成器 - 歧义器对来进行多个对抗训练，主要采用多阶段架构，在该培训中，用于提供发电指导的文本语义在所有阶段都保持静态。这项工作认为，每个阶段的文本特征应根据历史阶段的状态（即历史阶段的文本和图像特征）进行自适应重新组建，以在粗到精细的生成过程中提供多样化，准确的语义指导。因此，我们提出了一种新颖的动力学语义演化gan（DSE-GAN），以在新颖的单一对抗性多阶段体系结构下重新构成每个阶段的文本特征。具体而言，我们设计了（1）动态语义演化（DSE）模块，该模块首先汇总了历史图像特征以汇总生成反馈，然后动态选择在每个阶段重新组装的单词，并通过动态增强或抑制不同的粒状性子空间的语义的语义来重新组合它们。（2）单个对抗性多阶段体系结构（SAMA），通过消除复杂的多个对抗训练要求扩展了先前的结构，因此可以允许更多的文本图像相互作用阶段，并最终促进DSE模块。我们进行了全面的实验，并表明DSE-GAN分别在两个广泛使用的基准（即CUB-200和MSCOCO）上获得了7.48 \％和37.8％的相对FID。

Text-to-image generation aims at generating realistic images which are semantically consistent with the given text. Previous works mainly adopt the multi-stage architecture by stacking generator-discriminator pairs to engage multiple adversarial training, where the text semantics used to provide generation guidance remain static across all stages. This work argues that text features at each stage should be adaptively re-composed conditioned on the status of the historical stage (i.e., historical stage's text and image features) to provide diversified and accurate semantic guidance during the coarse-to-fine generation process. We thereby propose a novel Dynamical Semantic Evolution GAN (DSE-GAN) to re-compose each stage's text features under a novel single adversarial multi-stage architecture. Specifically, we design (1) Dynamic Semantic Evolution (DSE) module, which first aggregates historical image features to summarize the generative feedback, and then dynamically selects words required to be re-composed at each stage as well as re-composed them by dynamically enhancing or suppressing different granularity subspace's semantics. (2) Single Adversarial Multi-stage Architecture (SAMA), which extends the previous structure by eliminating complicated multiple adversarial training requirements and therefore allows more stages of text-image interactions, and finally facilitates the DSE module. We conduct comprehensive experiments and show that DSE-GAN achieves 7.48\% and 37.8\% relative FID improvement on two widely used benchmarks, i.e., CUB-200 and MSCOCO, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题