论文标题
基于人眼运动的深度卷积神经网络中的视觉关注
Guiding Visual Attention in Deep Convolutional Neural Networks Based on Human Eye Movements
论文作者
论文摘要
深度卷积神经网络(DCNN)最初是受生物视觉原理的启发,已演变为对象识别的最佳当前计算模型,因此表明在整个与神经图像和神经时间序列数据的比较中,都表明了与腹视觉途径的强大结构和功能并行性。随着深度学习的最新进展似乎降低了这种相似性,计算神经科学面临挑战,以逆转工程的生物学合理性,以获得有用的模型。虽然先前的研究表明,生物学启发的体系结构能够扩大模型的人类风格,但在本研究中,我们研究了一种纯粹的数据驱动方法。我们使用人类的眼睛跟踪数据直接修改训练示例,从而指导模型在自然图像中的对象识别期间的视觉关注朝着或远离人类固定的焦点。我们通过GradCam显着性图比较和验证不同的操纵类型(即标准,类人类和非人类的注意力)与人类参与者的眼动数据。我们的结果表明,与人类相比,所提出的指导焦点操作的作用于负方向上的意图,而非人类的模型则集中在明显不同的图像部分上。观察到的效果是高度类别特异性的,它通过动画和面部的存在增强,仅在完成前馈处理后才开发,并表明对面部检测产生了强烈的影响。然而,使用这种方法,没有发现人类的类似性。讨论了公开视觉关注在DCNN中的可能应用,并讨论了对面部检测理论的进一步影响。
Deep Convolutional Neural Networks (DCNNs) were originally inspired by principles of biological vision, have evolved into best current computational models of object recognition, and consequently indicate strong architectural and functional parallelism with the ventral visual pathway throughout comparisons with neuroimaging and neural time series data. As recent advances in deep learning seem to decrease this similarity, computational neuroscience is challenged to reverse-engineer the biological plausibility to obtain useful models. While previous studies have shown that biologically inspired architectures are able to amplify the human-likeness of the models, in this study, we investigate a purely data-driven approach. We use human eye tracking data to directly modify training examples and thereby guide the models' visual attention during object recognition in natural images either towards or away from the focus of human fixations. We compare and validate different manipulation types (i.e., standard, human-like, and non-human-like attention) through GradCAM saliency maps against human participant eye tracking data. Our results demonstrate that the proposed guided focus manipulation works as intended in the negative direction and non-human-like models focus on significantly dissimilar image parts compared to humans. The observed effects were highly category-specific, enhanced by animacy and face presence, developed only after feedforward processing was completed, and indicated a strong influence on face detection. With this approach, however, no significantly increased human-likeness was found. Possible applications of overt visual attention in DCNNs and further implications for theories of face detection are discussed.