论文标题
场景图预测的生成构图增强
Generative Compositional Augmentations for Scene Graph Prediction
论文作者
论文摘要
在视觉和语言交集的许多应用程序中,以场景图的形式推断出对象及其从图像的关系很有用。我们考虑了由于较长的尾部数据分布而在此任务中出现的一个具有挑战性的组成概括问题。当前的场景图生成模型是在对应于最常见组成的一小部分的一小部分中训练的,例如<杯子,上,表>。但是,测试图像可能包含对象和关系的零和少量组成,例如<杯子,上,冲浪板>。尽管每个对象类别和谓词(例如“ on”)在培训数据中经常存在,但这些模型通常无法正确理解这种看不见或稀有的构图。为了改善概括,自然尝试增加训练分布的多样性。但是,在图域中,这是不平凡的。为此,我们提出了一种通过扰动真实场景来综合稀有但合理的场景图的方法。然后,我们提出并经验研究基于条件生成对抗网络(GAN)的模型,该模型使我们能够生成扰动场景图的视觉特征,并以共同的方式向它们学习。当在视觉基因组数据集上进行评估时,我们的方法会产生边际,但在零射击指标中的一致改进。我们分析方法的局限性,指示未来研究的有希望的方向。
Inferring objects and their relationships from an image in the form of a scene graph is useful in many applications at the intersection of vision and language. We consider a challenging problem of compositional generalization that emerges in this task due to a long tail data distribution. Current scene graph generation models are trained on a tiny fraction of the distribution corresponding to the most frequent compositions, e.g. <cup, on, table>. However, test images might contain zero- and few-shot compositions of objects and relationships, e.g. <cup, on, surfboard>. Despite each of the object categories and the predicate (e.g. 'on') being frequent in the training data, the models often fail to properly understand such unseen or rare compositions. To improve generalization, it is natural to attempt increasing the diversity of the training distribution. However, in the graph domain this is non-trivial. To that end, we propose a method to synthesize rare yet plausible scene graphs by perturbing real ones. We then propose and empirically study a model based on conditional generative adversarial networks (GANs) that allows us to generate visual features of perturbed scene graphs and learn from them in a joint fashion. When evaluated on the Visual Genome dataset, our approach yields marginal, but consistent improvements in zero- and few-shot metrics. We analyze the limitations of our approach indicating promising directions for future research.