Novel-view Synthesis and Pose Estimation for Hand-Object Interaction from Sparse Views

Wentian Qu1,2
Chenyu Meng1,2
Cuixia Ma1,2
Hongan Wang1,2*

1Institute of Software, Chinese Academy of Sciences
2University of Chinese Academy of Sciences
3State Key Lab of CAD&CG, Zhejiang University

We propose a neural rendering and pose estimation system for hand-object interaction using sparse view images. (a) During offline stage, we learn hand and object models that enable rendering and shape reconstruction. During online stage, we initialize the pose from sparse camera views (b), and then conduct online fitting to improve pose estimation, which enables photo-realistic free viewpoint rendering (c). Our framework also naturally supports hand object interaction editing.


Hand-object interaction understanding and the barely addressed novel view synthesis are highly desired in the immersive communication, whereas it is challenging due to the high deformation of hand and heavy occlusions between handandobject. Inthispaper, we propose a neural rendering and pose estimation system for hand-object interaction from sparse views, which can also enable 3D hand-object interaction editing. We share the inspiration from recent scene understanding work that shows a scene specific model built beforehand can significantly improve and unblock vision tasks especially when inputs are sparse, and extend it to the dynamic hand-object interaction scenario and propose to solve the problem in two stages. We first learn the shape and appearance prior knowledge of hands and objects separately with the neural representation at the offline stage. During the online stage, we design a rendering based joint model fitting framework to understand the dynamic hand-object interaction with the pre-built hand and object models as well as interaction priors, which thereby overcomes penetration and separation issues between hand and object and also enables novel view synthesis. In order to get stable contact during the hand-object interaction process in a sequence, we propose a stable contact loss to make the contact region to be consistent. Experiments demonstrate that our method outperforms the state-of-the art methods.



Given sparse-view observations of hand-object interaction, we aim to generate free-viewpoint synthesis of the scene and estimate hand skeleton pose and object 6D pose. Our framework is divided into two stages: offline model building and online model fitting.

Offline stage to learn hand and object models. Top: Hand model. Each sampling point on the ray is converted to the local coordinate system of each hand part through bone transformation. Then we encode the point into embedding vectors and feed it to hand model to get the SDF and color value. Bottom: Object model. We convert the sampling point to the model coordinate with the object pose, and get the SDF and color value.
Online stage for joint model fitting. We utilize hand/object pose estimation networks for initialization, and refine pose with single-frame based loss function for single frame and video-based loss function for video sequence.

Paper and Code

W. Qu, Z. Cui, Y. Zhang, C. Meng, C. Ma, X. Deng, H. Wang

Novel-view Synthesis and Pose Estimation for Hand-Object Interaction from Sparse Views.

ICCV, 2023.

[Paper]     [Bibtex]     [Code]     [Dataset]    


Novel view synthesis and reconstruction of hand and hand-object interaction scenes. Our models contain realistic appearance and geometry details and can achieve full 360 degree free-viewpoint rendering.
(a) Effect of pose optimization in online stage with joint model fitting. Optimization can improve the pose accuracy. (b) Effect of interaction loss. Interaction loss can facilitate to achieve reasonable hand-object interaction. (c) Editing of hand-object interaction scenes. We can replace the hand, object models and change the poses to get realistic rendering results.


This work was supported in part by National Key R&D Program of China (2022ZD0117900). The websiteis modified from this template.