Universal Features Guided Zero-Shot Category-Level Object Pose Estimation

Wentian Qu1,2,3
Chenyu Meng1,2
Heng Li3
Jian Cheng1,2
Cuixia Ma1,2
Hongan Wang1,2
Xiao Zhou4

1Institute of Software, Chinese Academy of Sciences
2University of Chinese Academy of Sciences
3Hong Kong University of Science and Technology
4Aerospace Information Research Institute, Chinese Academy of Sciences


(a) We propose a zero-shot pose estimation method for unseen categories using universal features and obtain accurate results for multi-category scenes. Our method offers cost-efficient and superior generalization ability over traditional instance level and category-level methods. (b) The correspondence with universal features degrades when pose has large gaps. (c) The shape gap between objects will cause pose ambiguity in optimization. These challenges affect the accuracy of pose estimation.



Abstract

Object pose estimation, crucial in computer vision and robotics applications, faces challenges with the diversity of unseen categories. We propose a zero-shot method to achieve category-level 6-DOF object pose estimation, which exploits both 2D and 3D universal features of input RGB-D image to establish semantic similarity-based correspondences and can be extended to unseen categories without additional model fine-tuning. Our method begins with combining efficient 2D universal features to find sparse correspondences between intra-category objects and gets initial coarse pose. To handle the correspondence degradation of 2D universal features if the pose deviates much from the target pose, we use an iterative strategy to optimize the pose. Subsequently, to resolve pose ambiguities due to shape differences between intra-category objects, the coarse pose is refined by optimizing with dense alignment constraint of 3D universal features. Our method outperforms previous methods on the REAL275 and Wild6D benchmarks for unseen categories.




Overview

We exploits multi-modal (both 2D and 3D) universal features to estimate object pose on unseen categories. We design a coarse-to-fine framework for accurate 6-DOF pose estimation. At the coarse stage, it identifies sparse correspondences to solve an initial coarse object pose. Given an input RGB-D image, we use a reference model of the interested category to render reference images and extract 2D universal features from both the target and rendered reference images. We then calculate the cosine similarity map between the 2D features and use cyclical distance to select Top-k correspondences. Combined with the depth map and camera intrinsics, we choose the Top-k keypoints in the camera coordinate and calculate the transformation from the reference to the target space to get the initial coarse 6-DOF object pose by a least-squares solution. To deal with the problem of feature correspondence degradation of 2D universal features if the initial pose deviates much from the target pose, we use an iterative strategy to optimize the correspondence and coarse pose. After the coarse pose estimation, we map the reference model to the target image space to perform pose refinement with pixel-wise optimization. In order to resolve pose ambiguities due to shape differences between intra-category objects during the optimization, we employ 3D universal features extracted from the point cloud to refine the 6-DOF object pose and the reference model iteratively by dense pixel-level registration.

Overview. Our framework includes a keypoint-level coarse pose estimation module and a pixel-level pose refinement module. In the first module, we establish the correspondences between image pairs based on the 2D universal features and calculate the coarse pose using least squares in an iterative manner. In the second module, we use pixel-level optimization combined with 3D universal features to refine the pose and shape of reference model to obtain the fine pose.
(a) Pose Refinement. Based on the coarse pose as initialization, the reference model can be warped to the target space to obtain the initial mask and extract 3D universal features. Then we optimize the coarse pose and shape by minimizing the loss function. (b) After pose refinement stage, the pose and shape of the reference model are more accurately aligned with the target object.



Paper and Code

W. Qu, C. Meng, H. Li, J. Cheng, C. Ma, H. Wang, X. Zhou, X. Deng, P. Tan

Universal Features Guided Zero-Shot Category-Level Object Pose Estimation

AAAI, 2025

[Paper]     [Code, Coming Soon]    



Results

Qualitative results on REAL275 and Wild6D. The red box represents the ground truth, and the green box represents the estimation. Previous methods exhibit large errors when applied to unseen categories due to the significant texture and shape differences. Our method demonstrates strong generalization on unseen categories with accurate pose estimation.



Acknowledgements

This work was supported in part by National Science and Technology Major Project (2022ZD0117904), National Nat-ural Science Foundation of China (62473356,62373061), Beijing Natural Science Foundation (L232028), and CAS Major Project (RCJJ-145-24-14). Heng Li and Ping Tan are supported by the project HKPC22EG01-E from the Hong Kong Industrial Artificial Intelligence & Robotics Centre (FLAIR). The websiteis modified from this template.