茶：行动识别的时间激发和聚合

论文标题

茶：行动识别的时间激发和聚合

TEA: Temporal Excitation and Aggregation for Action Recognition

论文作者

Li, Yan, Ji, Bin, Shi, Xintian, Zhang, Jianguo, Kang, Bin, Wang, Limin

论文摘要

时间建模是视频中动作识别的关键。通常，它考虑了短期运动和远程聚合。在本文中，我们提出了一个时间激发和聚集（TEA）块，包括运动激发（ME）模块和多个时间聚集（MTA）模块，该模块是专门设计用于捕获短期和远程时间演化的。特别是，对于短距离运动建模，ME模块计算出与时空特征的特征级时差异。然后，它利用差异激发了功能的运动敏感通道。以前的工作中的远程时间聚集通常是通过堆叠大量局部时间卷积来实现的。每个卷积一次都会处理一个本地的时间窗口。相比之下，MTA模块提议将局部卷积变形为一组子互动，形成层次的残差体系结构。在不引入其他参数的情况下，这些功能将使用一系列子互换处理，并且每个帧都可以通过社区完成多个时间聚集。因此，时间维度的最终等效接收场被扩大，能够在遥远的框架上对远程时间关系进行建模。茶块的两个组成部分在时间建模中是互补的。最后，我们的方法在低触发器上取得了令人印象深刻的结果，以几种动作识别基准，例如动力学，某些东西，HMDB51和UCF101，这证实了其有效性和效率。

Temporal modeling is key for action recognition in videos. It normally considers both short-range motions and long-range aggregations. In this paper, we propose a Temporal Excitation and Aggregation (TEA) block, including a motion excitation (ME) module and a multiple temporal aggregation (MTA) module, specifically designed to capture both short- and long-range temporal evolution. In particular, for short-range motion modeling, the ME module calculates the feature-level temporal differences from spatiotemporal features. It then utilizes the differences to excite the motion-sensitive channels of the features. The long-range temporal aggregations in previous works are typically achieved by stacking a large number of local temporal convolutions. Each convolution processes a local temporal window at a time. In contrast, the MTA module proposes to deform the local convolution to a group of sub-convolutions, forming a hierarchical residual architecture. Without introducing additional parameters, the features will be processed with a series of sub-convolutions, and each frame could complete multiple temporal aggregations with neighborhoods. The final equivalent receptive field of temporal dimension is accordingly enlarged, which is capable of modeling the long-range temporal relationship over distant frames. The two components of the TEA block are complementary in temporal modeling. Finally, our approach achieves impressive results at low FLOPs on several action recognition benchmarks, such as Kinetics, Something-Something, HMDB51, and UCF101, which confirms its effectiveness and efficiency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题