论文标题
Croco V2:改进了立体声匹配和光流的跨视图完成预训练
CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow
论文作者
论文摘要
尽管对高级下游任务的表现令人印象深刻,但自我监督的预训练方法尚未完全按照密集的几何视觉任务(例如立体声匹配或光学流程)完全传递。自我监督概念(例如实例歧视或掩盖图像建模)的应用是一个活跃的研究领域。在这项工作中,我们建立在最近的跨视图完成框架上,这是掩盖图像建模的变体,从同一场景中利用第二视图,这使其非常适合双眼下游任务。到目前为止,该概念的适用性至少有两种方式受到限制:(a)由于很难收集现实世界图像对 - 实际上,仅使用了合成数据 - (b)由于缺乏香草变压器对密集的下游任务的概括,该任务的相对位置比绝对位置更有意义。我们探索了三个改进的途径。首先,我们引入了一种大规模收集合适的现实图像对的方法。其次,我们尝试使用相对位置嵌入,并表明它们使视觉变压器的性能能够更好地发挥作用。第三,我们扩展了基于视觉变压器的跨组件架构,这是通过使用大量数据而成为可能的。通过这些改进,我们首次展示了立体声匹配和光流的最新结果,而无需使用任何经典的特定任务特定技术,例如相关量,迭代估算,图像扭曲或多规模推理,从而铺平了朝着通用视觉模型铺平道路。
Despite impressive performance for high-level downstream tasks, self-supervised pre-training methods have not yet fully delivered on dense geometric vision tasks such as stereo matching or optical flow. The application of self-supervised concepts, such as instance discrimination or masked image modeling, to geometric tasks is an active area of research. In this work, we build on the recent cross-view completion framework, a variation of masked image modeling that leverages a second view from the same scene which makes it well suited for binocular downstream tasks. The applicability of this concept has so far been limited in at least two ways: (a) by the difficulty of collecting real-world image pairs -- in practice only synthetic data have been used -- and (b) by the lack of generalization of vanilla transformers to dense downstream tasks for which relative position is more meaningful than absolute position. We explore three avenues of improvement. First, we introduce a method to collect suitable real-world image pairs at large scale. Second, we experiment with relative positional embeddings and show that they enable vision transformers to perform substantially better. Third, we scale up vision transformer based cross-completion architectures, which is made possible by the use of large amounts of data. With these improvements, we show for the first time that state-of-the-art results on stereo matching and optical flow can be reached without using any classical task-specific techniques like correlation volume, iterative estimation, image warping or multi-scale reasoning, thus paving the way towards universal vision models.