论文标题
余弦会遇到SoftMax:可视接地的很难基准的基线
Cosine meets Softmax: A tough-to-beat baseline for visual grounding
论文作者
论文摘要
在本文中,我们为自动驾驶的视觉接地提供了一个简单的基线,该基线的表现优于最先进的方法,同时保留了最小的设计选择。我们的框架可以最大程度地减少多个图像ROI特征之间余弦距离的横向损失,并用文本嵌入(代表给出句子/短语)。我们使用预训练的网络来获取初始嵌入,并在文本嵌入的顶部学习转换层。我们在Talk2CAR数据集上执行实验,并达到68.7%的AP50准确性,使以前的最新水平提高了8.6%。我们的调查表明,通过在更简单的替代方案中表现出希望,对采用复杂的注意机制或多阶段推理或复杂的度量学习损失功能的更多方法进行了重新考虑。
In this paper, we present a simple baseline for visual grounding for autonomous driving which outperforms the state of the art methods, while retaining minimal design choices. Our framework minimizes the cross-entropy loss over the cosine distance between multiple image ROI features with a text embedding (representing the give sentence/phrase). We use pre-trained networks for obtaining the initial embeddings and learn a transformation layer on top of the text embedding. We perform experiments on the Talk2Car dataset and achieve 68.7% AP50 accuracy, improving upon the previous state of the art by 8.6%. Our investigation suggests reconsideration towards more approaches employing sophisticated attention mechanisms or multi-stage reasoning or complex metric learning loss functions by showing promise in simpler alternatives.