论文标题
DTG网络:差异化的老师指导自我监督视频动作识别
DTG-Net: Differentiated Teachers Guided Self-Supervised Video Action Recognition
论文作者
论文摘要
具有复杂网络体系结构的最先进的视频动作识别模型已归结为重大改进,但是这些模型在很大程度上取决于大规模标记的数据集。为了减少这种依赖性,我们提出了一个自我监督的教师建筑,即差异化的教师指导了自我监督的网络(DTG-NET)。在DTG-NET中,除了通过自我监督学习(SSL)降低标记的数据依赖性(SSL)之外,与训练相关的模型还用作教师指导,以提供先验知识,以减轻对SSL中大量未标记视频的需求。具体而言,DTG-NET利用与动作相关的任务中的多年努力,例如图像分类,基于图像的动作识别,在各种教师指导下,即那些训练有素的与动作相关的任务模型,了解了自我监督的视频表示。同时,DTG-NET以对比的自我监督学习方式进行了优化。当两个图像序列分别从相同的视频或与正面或负面对的视频中随机采样时,然后将它们发送给教师和学生网络以进行功能嵌入。之后,对比特征一致性在每对嵌入的特征嵌入之间定义,即,对于正对一致,负面对不一致。同时,为了反映各种教师任务的不同指导,我们还探讨了有关教师任务的不同加权指导。最后,通过两种方式评估DTG-NET:(i)仅使用未标记的视频预先培训监督的行动识别模型,以预先培训有监督的行动识别模型; (ii)受监督的DTG-NET将以端到端的方式与监督行动网络共同培训。与监督行动识别方法相比,它的性能比大多数训练方法更好,但具有出色的竞争力。
State-of-the-art video action recognition models with complex network architecture have archived significant improvements, but these models heavily depend on large-scale well-labeled datasets. To reduce such dependency, we propose a self-supervised teacher-student architecture, i.e., the Differentiated Teachers Guided self-supervised Network (DTG-Net). In DTG-Net, except for reducing labeled data dependency by self-supervised learning (SSL), pre-trained action related models are used as teacher guidance providing prior knowledge to alleviate the demand for a large number of unlabeled videos in SSL. Specifically, leveraging the years of effort in action-related tasks, e.g., image classification, image-based action recognition, the DTG-Net learns the self-supervised video representation under various teacher guidance, i.e., those well-trained models of action-related tasks. Meanwhile, the DTG-Net is optimized in the way of contrastive self-supervised learning. When two image sequences are randomly sampled from the same video or different videos as the positive or negative pairs, respectively, they are then sent to the teacher and student networks for feature embedding. After that, the contrastive feature consistency is defined between features embedding of each pair, i.e., consistent for positive pair and inconsistent for negative pairs. Meanwhile, to reflect various teacher tasks' different guidance, we also explore different weighted guidance on teacher tasks. Finally, the DTG-Net is evaluated in two ways: (i) the self-supervised DTG-Net to pre-train the supervised action recognition models with only unlabeled videos; (ii) the supervised DTG-Net to be jointly trained with the supervised action networks in an end-to-end way. Its performance is better than most pre-training methods but also has excellent competitiveness compared to supervised action recognition methods.