论文标题
看到声音和听力的声音:使用跨模式的自我设计学习歧视性嵌入
Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision
论文作者
论文摘要
这项工作的目的是训练判别性跨模式嵌入,而无需访问手动注释的数据。自我监管学习的最新进展表明,可以从自然的跨模式同步中学到有效的表示。我们基于较早的工作来训练对单模式下游任务更具歧视性的嵌入。为此,我们提出了一种新颖的培训策略,该策略不仅可以优化跨模式的指标,而且还可以在每种方式内实施类内部特征分离。该方法的有效性是在两个下游任务上证明的:使用在视听同步进行训练的功能的唇读,以及使用训练跨模式生物识别匹配的功能来识别扬声器识别。所提出的方法优于最先进的自我监督基线,这是一个显着的边缘。
The goal of this work is to train discriminative cross-modal embeddings without access to manually annotated data. Recent advances in self-supervised learning have shown that effective representations can be learnt from natural cross-modal synchrony. We build on earlier work to train embeddings that are more discriminative for uni-modal downstream tasks. To this end, we propose a novel training strategy that not only optimises metrics across modalities, but also enforces intra-class feature separation within each of the modalities. The effectiveness of the method is demonstrated on two downstream tasks: lip reading using the features trained on audio-visual synchronisation, and speaker recognition using the features trained for cross-modal biometric matching. The proposed method outperforms state-of-the-art self-supervised baselines by a signficant margin.