基于多头融合变压器的AU检测的多模式学习

论文标题

基于多头融合变压器的AU检测的多模式学习

Multi-Modal Learning for AU Detection Based on Multi-Head Fused Transformers

论文作者

Zhang, Xiang, Yin, Lijun

论文摘要

近年来，多模式学习已得到加强，尤其是对于面部分析和行动单元检测的应用，而在1）相关特征学习方面仍然存在两个主要挑战，以及2）有效融合多模式。最近，有许多作品显示出利用注意力检测的注意力机制的有效性，但是，其中大多数都在绑定了感兴趣的区域（ROI）具有特征，但很少在每个AU的特征之间使用注意力。另一方面，使用更有效的自我发挥机制的变压器已被广泛用于自然语言处理和计算机视觉任务中，但在AU检测任务中尚未完全探索。在本文中，我们提出了一种用于AU检测的新型端到端多头融合变压器（MFT）方法，该方法通过变压器编码器从不同模态中学习了AU编码特征表示，并通过另一个Fusion Fransformer模块融合了模态。在融合变压器模块中设计了多头融合注意力，以有效地融合多种方式。我们的方法对两个公共多模式AU数据库BP4D和BP4D+进行了评估，结果优于最先进的算法和基线模型。我们进一步分析了来自不同方式的AU检测性能。

Multi-modal learning has been intensified in recent years, especially for applications in facial analysis and action unit detection whilst there still exist two main challenges in terms of 1) relevant feature learning for representation and 2) efficient fusion for multi-modalities. Recently, there are a number of works have shown the effectiveness in utilizing the attention mechanism for AU detection, however, most of them are binding the region of interest (ROI) with features but rarely apply attention between features of each AU. On the other hand, the transformer, which utilizes a more efficient self-attention mechanism, has been widely used in natural language processing and computer vision tasks but is not fully explored in AU detection tasks. In this paper, we propose a novel end-to-end Multi-Head Fused Transformer (MFT) method for AU detection, which learns AU encoding features representation from different modalities by transformer encoder and fuses modalities by another fusion transformer module. Multi-head fusion attention is designed in the fusion transformer module for the effective fusion of multiple modalities. Our approach is evaluated on two public multi-modal AU databases, BP4D, and BP4D+, and the results are superior to the state-of-the-art algorithms and baseline models. We further analyze the performance of AU detection from different modalities.

下载PDF全文

下载文献需遵守相关版权规定

论文标题