T3VIP：基于转换的3D视频预测

论文标题

T3VIP：基于转换的3D视频预测

T3VIP: Transformation-based 3D Video Prediction

论文作者

Nematollahi, Iman, Rosete-Beas, Erick, Azad, Seyed Mahdi B., Rajan, Raghu, Hutter, Frank, Burgard, Wolfram

论文摘要

对于自主技能的获取，机器人必须了解从过去的经验中掌管3D世界动态的物理规则，以预测和理由关于合理的未来结果。为此，我们提出了一种基于转换的3D视频预测（T3VIP）方法，该方法通过将场景分解为对象部分并预测其相应的刚性变换来明确对3D运动进行建模。我们的模型是完全无监督的，捕捉了现实世界的随机性质，图像和点云领域中的观察提示构成了其学习信号。为了充分利用所有2D和3D观测信号，我们为模型配备了自动的超参数优化（HPO），以解释从中学习的最佳方法。据我们所知，我们的模型是第一个为静态相机提供RGB-D视频预测的生成模型。我们对模拟和现实世界数据集进行了广泛的评估表明，我们的配方会导致可解释的3D模型，这些模型可以预测未来的深度视频，同时在RGB视频预测上使用2D模型实现PAR性能。此外，我们证明我们的模型在视觉运动控制方面优于2D基准。视频，代码，数据集和预培训模型可在http://t3vip.cs.uni-freiburg.de上找到。

For autonomous skill acquisition, robots have to learn about the physical rules governing the 3D world dynamics from their own past experience to predict and reason about plausible future outcomes. To this end, we propose a transformation-based 3D video prediction (T3VIP) approach that explicitly models the 3D motion by decomposing a scene into its object parts and predicting their corresponding rigid transformations. Our model is fully unsupervised, captures the stochastic nature of the real world, and the observational cues in image and point cloud domains constitute its learning signals. To fully leverage all the 2D and 3D observational signals, we equip our model with automatic hyperparameter optimization (HPO) to interpret the best way of learning from them. To the best of our knowledge, our model is the first generative model that provides an RGB-D video prediction of the future for a static camera. Our extensive evaluation with simulated and real-world datasets demonstrates that our formulation leads to interpretable 3D models that predict future depth videos while achieving on-par performance with 2D models on RGB video prediction. Moreover, we demonstrate that our model outperforms 2D baselines on visuomotor control. Videos, code, dataset, and pre-trained models are available at http://t3vip.cs.uni-freiburg.de.

下载PDF全文

下载文献需遵守相关版权规定

论文标题