论文标题
Vitranspad:使用卷积和自我注意的视频变压器进行面部表现攻击检测
ViTransPAD: Video Transformer using convolution and self-attention for Face Presentation Attack Detection
论文作者
论文摘要
面部表现攻击检测(PAD)是防止面部生物识别系统欺骗攻击的重要方法。许多基于卷积神经网络(CNN)的脸部垫的作品将问题作为图像级二进制分类任务提出问题,而无需考虑上下文。另外,使用自我注意力来参加图像上下文的视觉变压器(VIT)成为面部主流的主流。受VIT的启发,我们提出了一个基于视频的变压器(Vitranspad),具有短/远程时空的注意力,它不仅可以集中在框架内短期关注的本地细节上,而且还可以捕获对框架的长期依赖性。我们建议使用多尺度的多头自我注意力(MSMHSA)体系结构将粗略的图像斑块与单一尺度一起使用,以适应Q,k,v的多尺度贴片分区。由于缺乏纯变压器中卷积的感应性偏见,我们还通过使用卷积贴片嵌入和卷积投影将拟议的VitransPad引入了卷积,以整合CNN的理想特性。广泛的实验表明我们提出的玻璃板具有可取的精度兼容平衡的有效性,可以用作面部垫的新骨架。
Face Presentation Attack Detection (PAD) is an important measure to prevent spoof attacks for face biometric systems. Many works based on Convolution Neural Networks (CNNs) for face PAD formulate the problem as an image-level binary classification task without considering the context. Alternatively, Vision Transformers (ViT) using self-attention to attend the context of an image become the mainstreams in face PAD. Inspired by ViT, we propose a Video-based Transformer for face PAD (ViTransPAD) with short/long-range spatio-temporal attention which can not only focus on local details with short attention within a frame but also capture long-range dependencies over frames. Instead of using coarse image patches with single-scale as in ViT, we propose the Multi-scale Multi-Head Self-Attention (MsMHSA) architecture to accommodate multi-scale patch partitions of Q, K, V feature maps to the heads of transformer in a coarse-to-fine manner, which enables to learn a fine-grained representation to perform pixel-level discrimination for face PAD. Due to lack inductive biases of convolutions in pure transformers, we also introduce convolutions to the proposed ViTransPAD to integrate the desirable properties of CNNs by using convolution patch embedding and convolution projection. The extensive experiments show the effectiveness of our proposed ViTransPAD with a preferable accuracy-computation balance, which can serve as a new backbone for face PAD.