Siri：用于基于变压器的视觉接地的简单选择性重新训练机制

论文标题

Siri：用于基于变压器的视觉接地的简单选择性重新训练机制

SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding

论文作者

Qu, Mengxue, Wu, Yu, Liu, Wu, Gong, Qiqi, Liang, Xiaodan, Russakovsky, Olga, Zhao, Yao, Wei, Yunchao

论文摘要

在本文中，我们调查了如何使用现代视觉语言变形金刚获得更好的视觉接地，并为这项具有挑战性的任务提出了一种简单而强大的选择性训练（SIRI）机制。特别是，Siri传达了视觉接地研究的重要原则，即更好的初始视觉语言编码器将有助于该模型收敛到更好的局部最低限度，从而相应地提高性能。具体而言，随着训练的进行，我们不断更新编码器的参数，而定期重新定位的其余参数则可以根据增强的编码来更好地优化模型。 Siri在三个流行的基准测试中的表现可以大大优于以前的方法。具体而言，我们的方法在Refcoco+ Testa上达到了83.04％的TOP1精度，表现优于最先进的方法（从头开始训练）超过10.21％。此外，我们透露，即使培训数据有限，Siri也表现出色。我们还将其扩展到基于变压器的视觉接地模型和其他视觉语言任务，以验证有效性。

In this paper, we investigate how to achieve better visual grounding with modern vision-language transformers, and propose a simple yet powerful Selective Retraining (SiRi) mechanism for this challenging task. Particularly, SiRi conveys a significant principle to the research of visual grounding, i.e., a better initialized vision-language encoder would help the model converge to a better local minimum, advancing the performance accordingly. In specific, we continually update the parameters of the encoder as the training goes on, while periodically re-initialize rest of the parameters to compel the model to be better optimized based on an enhanced encoder. SiRi can significantly outperform previous approaches on three popular benchmarks. Specifically, our method achieves 83.04% Top1 accuracy on RefCOCO+ testA, outperforming the state-of-the-art approaches (training from scratch) by more than 10.21%. Additionally, we reveal that SiRi performs surprisingly superior even with limited training data. We also extend it to transformer-based visual grounding models and other vision-language tasks to verify the validity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题