Submission name | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Submission time (UTC) | Nov. 20, 2024, 5:07 p.m. | ||||||||||
User | morozart | ||||||||||
Task | Model-based 6D localization of unseen objects | ||||||||||
Dataset | IC-BIN | ||||||||||
Description | |||||||||||
Evaluation scores |
|
User | morozart |
---|---|
Publication | Not yet |
Implementation | |
Training image modalities | RGB-D |
Test image modalities | RGB |
Description | Submitted to: BOP Challenge 2024 Training data: MegaPose-ShapeNet and MegaPose-GSO synthetic datasets Onboarding data: Model-based: 42 templates are rendered from CAD model using BlenderProc Model-free: 42 templates are rendered from a NeRF, trained on a set of multi view images with camera poses Used 3D models: Default models for all datasets in model-based challenge. Notes: The methodology addresses 6D object pose estimation for novel objects in both model-based and model-free scenarios. It assumes the availability of 2D detections in the form of bounding boxes and object categories. Patch descriptors are extracted from cropped test images and object templates using the frozen DINOv2 [A] feature extractor. A transformer encoder aggregates the template patch descriptors and applies 3D positional embedding to generate enhanced object representation. Subsequently, a transformer decoder establishes correspondences between template and query image patch descriptors. Finally, method refines correspondences, selects top-k templates, establishes 2D-3D correspondences and estimates a 6D pose is via RANSAC-PnP. [A] Maxime Oquab et al: DINOv2: Learning Robust Visual Features without Supervision, Transactions on Machine Learning Research |
Computer specifications | NVIDIA A40 (inference), NVIDIA A100 (training) |