节拍：用于会话手势的大规模语义和情感多模式数据集

论文标题

节拍：用于会话手势的大规模语义和情感多模式数据集

BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis

论文作者

Liu, Haiyang, Zhu, Zihao, Iwamoto, Naoya, Peng, Yichen, Li, Zhengqing, Zhou, You, Bozkurt, Elif, Zheng, Bo

论文摘要

由于缺乏可用的数据集，模型和标准评估指标，因此以多模式数据为条件的现实，生动和类似人类的综合对话手势仍然是一个未解决的问题。为了解决这个问题，我们构建了人体表达式 - 审计数据集，Beat，它具有76小时，高质量的多模式数据，这些数据从30位使用八种不同的情绪和四种不同语言的说话者中捕获，ii）3.2亿帧层面的情感和语义相关性注释。我们对BEAT的统计分析表明，除了与音频，文本和说话者身份的已知相关性外，对话性手势与面部表情，情感和语义的相关性。基于此观察结果，我们提出了一个基线模型，即级联运动网络（CAMN），该模型由以上六种模式组成，该模式在级联的架构中建模以进行手势合成。为了评估语义相关性，我们引入了指标，语义相关性召回（SRGR）。定性和定量实验表明，指标的有效性，地面真相数据质量以及基线的最先进性能。据我们所知，Beat是用于研究人类手势的最大运动捕获数据集，这可能有助于许多不同的研究领域，包括可控的手势合成，跨模式分析和情感手势识别。数据，代码和模型可在https://pantomatrix.github.io/beat/上找到。

Achieving realistic, vivid, and human-like synthesized conversational gestures conditioned on multi-modal data is still an unsolved problem due to the lack of available datasets, models and standard evaluation metrics. To address this, we build Body-Expression-Audio-Text dataset, BEAT, which has i) 76 hours, high-quality, multi-modal data captured from 30 speakers talking with eight different emotions and in four different languages, ii) 32 millions frame-level emotion and semantic relevance annotations. Our statistical analysis on BEAT demonstrates the correlation of conversational gestures with facial expressions, emotions, and semantics, in addition to the known correlation with audio, text, and speaker identity. Based on this observation, we propose a baseline model, Cascaded Motion Network (CaMN), which consists of above six modalities modeled in a cascaded architecture for gesture synthesis. To evaluate the semantic relevancy, we introduce a metric, Semantic Relevance Gesture Recall (SRGR). Qualitative and quantitative experiments demonstrate metrics' validness, ground truth data quality, and baseline's state-of-the-art performance. To the best of our knowledge, BEAT is the largest motion capture dataset for investigating human gestures, which may contribute to a number of different research fields, including controllable gesture synthesis, cross-modality analysis, and emotional gesture recognition. The data, code and model are available on https://pantomatrix.github.io/BEAT/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题