大型数据集是否会进一步增强时空3D CNN？

论文标题

大型数据集是否会进一步增强时空3D CNN？

Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs?

论文作者

Kataoka, Hirokatsu, Wakamiya, Tenga, Hara, Kensho, Satoh, Yutaka

论文摘要

我们如何收集和使用视频数据集进一步改善时空3D卷积神经网络（3D CNN）？为了在视频识别中积极回答这个公开问题，我们使用了几个大型视频数据集和3D CNN进行了探索研究。在深层神经网络的早期时代，在视频识别的背景下，2D CNN比3D CNN好。最近的研究表明，3D CNN可以胜过在大型视频数据集上训练的2D CNN。但是，我们在很大程度上依靠体系结构探索而不是数据集考虑。因此，在本文中，我们进行勘探研究是为了改善时空3D CNN，如下所示：（i）最近提出的大型视频数据集有助于改善视频分类准确性方面的时空3D CNN。我们揭示了一个经过精心注释的数据集（例如动力学-700）有效地预先培训视频分类任务。（ii）我们确认＃类别/＃实例与视频分类精度之间的关系。结果表明，＃category最初应修复，然后在数据集构造时在视频数据集中增加#instance。（iii）为了实际扩展视频数据集，我们只是简单地将公开可用的数据集串联，例如Kinetics-700和时间（MIT）数据集。与动力学-700预训练相比，我们在UCF-101，HMDB-51和ActivityNet数据集上进一步增强了使用合并数据集的时空3D CNN，例如，+0.9，+3.4和+1.1。（iv）就识别体系结构而言，动力学-700和合并的数据集预训练的模型将识别性能与残留网络（RESNET）增加到200层，而Kinetics-400预培训模型无法成功地优化200层架构。

How can we collect and use a video dataset to further improve spatiotemporal 3D Convolutional Neural Networks (3D CNNs)? In order to positively answer this open question in video recognition, we have conducted an exploration study using a couple of large-scale video datasets and 3D CNNs. In the early era of deep neural networks, 2D CNNs have been better than 3D CNNs in the context of video recognition. Recent studies revealed that 3D CNNs can outperform 2D CNNs trained on a large-scale video dataset. However, we heavily rely on architecture exploration instead of dataset consideration. Therefore, in the present paper, we conduct exploration study in order to improve spatiotemporal 3D CNNs as follows: (i) Recently proposed large-scale video datasets help improve spatiotemporal 3D CNNs in terms of video classification accuracy. We reveal that a carefully annotated dataset (e.g., Kinetics-700) effectively pre-trains a video representation for a video classification task. (ii) We confirm the relationships between #category/#instance and video classification accuracy. The results show that #category should initially be fixed, and then #instance is increased on a video dataset in case of dataset construction. (iii) In order to practically extend a video dataset, we simply concatenate publicly available datasets, such as Kinetics-700 and Moments in Time (MiT) datasets. Compared with Kinetics-700 pre-training, we further enhance spatiotemporal 3D CNNs with the merged dataset, e.g., +0.9, +3.4, and +1.1 on UCF-101, HMDB-51, and ActivityNet datasets, respectively, in terms of fine-tuning. (iv) In terms of recognition architecture, the Kinetics-700 and merged dataset pre-trained models increase the recognition performance to 200 layers with the Residual Network (ResNet), while the Kinetics-400 pre-trained model cannot successfully optimize the 200-layer architecture.

下载PDF全文

下载文献需遵守相关版权规定

论文标题