连续手语识别的自我强调网络

论文标题

连续手语识别的自我强调网络

Self-Emphasizing Network for Continuous Sign Language Recognition

论文作者

Hu, Lianyu, Gao, Liqing, liu, Zekang, Feng, Wei

论文摘要

手和脸在表达手语中起着重要的作用。它们的功能通常特别利用以提高系统性能。但是，为了有效提取视觉表示并捕获手和脸部的轨迹，先前的方法总是以增加训练复杂性的高度计算。他们通常采用额外的重姿势估计网络来定位人体关键点或依靠其他预提取的热图进行监督。为了缓解这个问题，我们建议一个自我强调的网络（SEN）以自我激励的方式强调信息的空间区域，而没有额外的计算，而没有额外的昂贵监督。具体而言，SEN首先采用轻型子网来结合局部时空特征，以识别信息丰富的区域，然后通过注意力图动态增强原始特征。还观察到，并非所有框架都同样有助于识别。我们提出了一个时间自我强调的模块，以适应这些歧视框架并抑制冗余框架。与先前配备的手和面部功能的方法进行了全面比较，证明了我们方法的优势，即使它们始终需要大量计算并依靠昂贵的额外监督。值得注意的是，由于几乎没有额外的计算，SEN可以在四个大型数据集（Phoenix14，Phoenix14-T，CSL Daily和CSL）上实现新的最新精度。可视化验证了SEN对强调内容丰富的空间和时间特征的影响。代码可从https://github.com/hulianyuyy/sen_cslr获得

Hand and face play an important role in expressing sign language. Their features are usually especially leveraged to improve system performance. However, to effectively extract visual representations and capture trajectories for hands and face, previous methods always come at high computations with increased training complexity. They usually employ extra heavy pose-estimation networks to locate human body keypoints or rely on additional pre-extracted heatmaps for supervision. To relieve this problem, we propose a self-emphasizing network (SEN) to emphasize informative spatial regions in a self-motivated way, with few extra computations and without additional expensive supervision. Specifically, SEN first employs a lightweight subnetwork to incorporate local spatial-temporal features to identify informative regions, and then dynamically augment original features via attention maps. It's also observed that not all frames contribute equally to recognition. We present a temporal self-emphasizing module to adaptively emphasize those discriminative frames and suppress redundant ones. A comprehensive comparison with previous methods equipped with hand and face features demonstrates the superiority of our method, even though they always require huge computations and rely on expensive extra supervision. Remarkably, with few extra computations, SEN achieves new state-of-the-art accuracy on four large-scale datasets, PHOENIX14, PHOENIX14-T, CSL-Daily, and CSL. Visualizations verify the effects of SEN on emphasizing informative spatial and temporal features. Code is available at https://github.com/hulianyuyy/SEN_CSLR

下载PDF全文

下载文献需遵守相关版权规定

论文标题