论文标题
统计时刻和子空间描述符的自我探索行动识别
Self-supervising Action Recognition by Statistical Moment and Subspace Descriptors
论文作者
论文摘要
在本文中,我们通过将RGB框架作为学习预测动作概念和辅助描述符(例如对象描述符)的输入来建立一个自学概念。对所谓的幻觉流进行了训练,可以预测辅助提示,同时融入分类层,然后在测试阶段幻觉以帮助网络。我们设计和幻觉两个描述符,一个利用四个流行对象探测器应用于培训视频,另一个利用图像和视频水平显着探测器。第一个描述符编码探测器和成像类别类预测分数,置信度得分以及边界框和框架索引的空间位置,以捕获每个视频的特征的时空分布。另一个描述符编码显着图和强度模式的空间角梯度分布。受概率分布的特征函数的启发,我们在上述中间描述符上捕获了四个统计矩。随着平均值,协方差,coskewness和cokurtotsis中的系数数量,在W.R.T.上线性,四边形,立方体和四方面生长。特征向量的维度,我们通过其领先的n'特征向量(所谓的子空间)来描述协方差矩阵,并捕获偏斜/峰度,而不是昂贵的coskewness/cokurtosis。我们在五个受欢迎的数据集(例如Charades和Epic-kitchens)上获得了艺术状态。
In this paper, we build on a concept of self-supervision by taking RGB frames as input to learn to predict both action concepts and auxiliary descriptors e.g., object descriptors. So-called hallucination streams are trained to predict auxiliary cues, simultaneously fed into classification layers, and then hallucinated at the testing stage to aid network. We design and hallucinate two descriptors, one leveraging four popular object detectors applied to training videos, and the other leveraging image- and video-level saliency detectors. The first descriptor encodes the detector- and ImageNet-wise class prediction scores, confidence scores, and spatial locations of bounding boxes and frame indexes to capture the spatio-temporal distribution of features per video. Another descriptor encodes spatio-angular gradient distributions of saliency maps and intensity patterns. Inspired by the characteristic function of the probability distribution, we capture four statistical moments on the above intermediate descriptors. As numbers of coefficients in the mean, covariance, coskewness and cokurtotsis grow linearly, quadratically, cubically and quartically w.r.t. the dimension of feature vectors, we describe the covariance matrix by its leading n' eigenvectors (so-called subspace) and we capture skewness/kurtosis rather than costly coskewness/cokurtosis. We obtain state of the art on five popular datasets such as Charades and EPIC-Kitchens.