Wentian Qu1,2,3
|
Chenyu Meng1,2
|
Heng Li3
|
Jian Cheng1,2
|
Cuixia Ma1,2
|
Hongan Wang1,2
|
Xiao Zhou4
|
Xiaoming Deng1,2*
|
Ping Tan3*
|
1Institute of Software, Chinese Academy of Sciences
|
2University of Chinese Academy of Sciences
|
3Hong Kong University of Science and Technology
|
4Aerospace Information Research Institute, Chinese Academy of Sciences
|
|
(a) We propose a zero-shot pose estimation method for unseen categories using universal features and obtain accurate
results for multi-category scenes. Our method offers cost-efficient and superior generalization ability over traditional instance
level and category-level methods. (b) The correspondence with universal features degrades when pose has large gaps. (c) The
shape gap between objects will cause pose ambiguity in optimization. These challenges affect the accuracy of pose estimation.
|
Object pose estimation, crucial in computer vision and robotics applications, faces challenges with the diversity of unseen categories. We propose a zero-shot method to achieve category-level 6-DOF object pose estimation, which exploits both 2D and 3D universal features of input RGB-D image to establish semantic similarity-based correspondences and can be extended to unseen categories without additional model fine-tuning. Our method begins with combining efficient 2D universal features to find sparse correspondences between intra-category objects and gets initial coarse pose. To handle the correspondence degradation of 2D universal features if the pose deviates much from the target pose, we use an iterative strategy to optimize the pose. Subsequently, to resolve pose ambiguities due to shape differences between intra-category objects, the coarse pose is refined by optimizing with dense alignment constraint of 3D universal features. Our method outperforms previous methods on the REAL275 and Wild6D benchmarks for unseen categories.
We exploits multi-modal (both 2D and 3D) universal features to estimate object pose on unseen categories. We design a coarse-to-fine framework for accurate 6-DOF pose estimation. At the coarse stage, it identifies sparse correspondences to solve an initial coarse object pose. Given an input RGB-D image, we use a reference model of the interested category to render reference images and extract 2D universal features from both the target and rendered reference images. We then calculate the cosine similarity map between the 2D features and use cyclical distance to select Top-k correspondences. Combined with the depth map and camera intrinsics, we choose the Top-k keypoints in the camera coordinate and calculate the transformation from the reference to the target space to get the initial coarse 6-DOF object pose by a least-squares solution. To deal with the problem of feature correspondence degradation of 2D universal features if the initial pose deviates much from the target pose, we use an iterative strategy to optimize the correspondence and coarse pose. After the coarse pose estimation, we map the reference model to the target image space to perform pose refinement with pixel-wise optimization. In order to resolve pose ambiguities due to shape differences between intra-category objects during the optimization, we employ 3D universal features extracted from the point cloud to refine the 6-DOF object pose and the reference model iteratively by dense pixel-level registration.
|
Overview. Our framework includes a keypoint-level coarse pose estimation module and a pixel-level pose refinement
module. In the first module, we establish the correspondences between image pairs based on the 2D universal features and
calculate the coarse pose using least squares in an iterative manner. In the second module, we use pixel-level optimization
combined with 3D universal features to refine the pose and shape of reference model to obtain the fine pose.
|
|
(a) Pose Refinement. Based on the coarse pose as
initialization, the reference model can be warped to the
target space to obtain the initial mask and extract 3D
universal features. Then we optimize the coarse pose and shape
by minimizing the loss function. (b) After pose refinement
stage, the pose and shape of the reference model are more
accurately aligned with the target object.
|
W. Qu, C. Meng, H. Li, J. Cheng, C. Ma, H. Wang, X. Zhou, X. Deng, P. Tan
Universal Features Guided Zero-Shot Category-Level Object Pose Estimation AAAI, 2025 [Paper] [Code, Coming Soon] |
|
Qualitative results on REAL275 and Wild6D. The red box represents the ground truth,
and the green box represents the estimation. Previous methods exhibit large errors
when applied to unseen categories due to the significant texture and shape differences.
Our method demonstrates strong generalization on unseen categories with accurate pose estimation.
|
Acknowledgements
This work was supported in part by National Science and Technology Major Project (2022ZD0117904),
National Nat-ural Science Foundation of China (62473356,62373061), Beijing Natural Science Foundation (L232028),
and CAS Major Project (RCJJ-145-24-14). Heng Li and Ping Tan are supported by the project HKPC22EG01-E from the
Hong Kong Industrial Artificial Intelligence & Robotics Centre (FLAIR).
The websiteis modified from this template.
|