DF-GAN：文本对图像合成的简单有效的基线

论文标题

DF-GAN：文本对图像合成的简单有效的基线

DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis

论文作者

Tao, Ming, Tang, Hao, Wu, Fei, Jing, Xiao-Yuan, Bao, Bing-Kun, Xu, Changsheng

论文摘要

从文本描述中综合高质量的现实图像是一项具有挑战性的任务。现有的文本到图像生成的对抗网络通常使用堆叠式体系结构，因为骨干仍然仍然存在三个缺陷。首先，堆叠的体系结构引入了不同图像量表的发电机之间的纠缠。其次，现有研究更喜欢在对抗性学习中应用和修复额外的网络，以限制文本图像语义一致性，这限制了这些网络的监督能力。第三，由于计算成本，先前作品广泛采用的基于跨模式的基于基于注意力的文本图像融合受到了几个特殊图像量表的限制。对于这些目的，我们提出了一个更简单但更有效的深层融合生成对抗网络（DF-GAN）。 To be specific, we propose: (i) a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators, (ii) a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output, which enhances the text-image semantic consistency without introducing extra networks, (iii) a novel deep text-image fusion block, which deepens the fusion process在文本和视觉特征之间完整融合。与当前的最新方法相比，我们提出的DF-GAN更简单，但更有效地综合了现实和文本匹配的图像，并在广泛使用的数据集上实现了更好的性能。

Synthesizing high-quality realistic images from text descriptions is a challenging task. Existing text-to-image Generative Adversarial Networks generally employ a stacked architecture as the backbone yet still remain three flaws. First, the stacked architecture introduces the entanglements between generators of different image scales. Second, existing studies prefer to apply and fix extra networks in adversarial learning for text-image semantic consistency, which limits the supervision capability of these networks. Third, the cross-modal attention-based text-image fusion that widely adopted by previous works is limited on several special image scales because of the computational cost. To these ends, we propose a simpler but more effective Deep Fusion Generative Adversarial Networks (DF-GAN). To be specific, we propose: (i) a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators, (ii) a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output, which enhances the text-image semantic consistency without introducing extra networks, (iii) a novel deep text-image fusion block, which deepens the fusion process to make a full fusion between text and visual features. Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images and achieves better performance on widely used datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题