论文标题
详细的人类对象相互作用的2d-3d联合表示
Detailed 2D-3D Joint Representation for Human-Object Interaction
论文作者
论文摘要
人类对象相互作用(HOI)检测是动作理解的核心。除了2D信息(例如人类/对象外观和位置)外,由于其视图独立性,3D姿势通常也用于HOI学习。但是,粗糙的3D身体关节只是带有稀疏的身体信息,不足以理解复杂的相互作用。因此,我们需要详细的3D身体形状才能进一步发展。同时,在HOI学习中也没有完全研究3D中的相互作用对象。鉴于这些,我们提出了一种详细的2d-3d联合表示学习方法。首先,我们利用单视的人体捕获方法来获得详细的3D身体,面部和手部形状。接下来,我们参考2D人类对象空间配置和对象类别先验估算3D对象位置和大小。最后,提出了联合学习框架和跨模式一致性任务来学习联合HOI表示。为了更好地评估模型的2D歧义处理能力,我们提出了一个新的基准测试,该基准被含糊不清,由硬模棱两可的图像组成。大规模HOI基准和模棱两可的HOI的广泛实验表现出我们方法令人印象深刻的有效性。代码和数据可在https://github.com/dirtyharrylyl/dj-rn上找到。
Human-Object Interaction (HOI) detection lies at the core of action understanding. Besides 2D information such as human/object appearance and locations, 3D pose is also usually utilized in HOI learning since its view-independence. However, rough 3D body joints just carry sparse body information and are not sufficient to understand complex interactions. Thus, we need detailed 3D body shape to go further. Meanwhile, the interacted object in 3D is also not fully studied in HOI learning. In light of these, we propose a detailed 2D-3D joint representation learning method. First, we utilize the single-view human body capture method to obtain detailed 3D body, face and hand shapes. Next, we estimate the 3D object location and size with reference to the 2D human-object spatial configuration and object category priors. Finally, a joint learning framework and cross-modal consistency tasks are proposed to learn the joint HOI representation. To better evaluate the 2D ambiguity processing capacity of models, we propose a new benchmark named Ambiguous-HOI consisting of hard ambiguous images. Extensive experiments in large-scale HOI benchmark and Ambiguous-HOI show impressive effectiveness of our method. Code and data are available at https://github.com/DirtyHarryLYL/DJ-RN.