BOP: Benchmark for 6D Object Pose Estimation

Submission: CNOS (SAM)/ITODD/SAM

Download submission

Submission name

SAM

Submission time (UTC)

Aug. 2, 2023, 12:49 p.m.

User

nvnguyen

Task

Model-based 2D detection of unseen objects

Dataset

ITODD

Description

Evaluation scores

AP:	0.313
AP50:	0.433
AP75:	0.342
AP_large:	0.319
AP_medium:	0.160
AP_small:	-1.000
AR1:	0.194
AR10:	0.496
AR100:	0.500
AR_large:	0.499
AR_medium:	0.377
AR_small:	-1.000
average_time_per_image:	2.136

Method: CNOS (SAM)

User	nvnguyen
Publication	https://arxiv.org/abs/2307.11067
Implementation	https://github.com/nv-nguyen/cnos
Training image modalities	None
Test image modalities	RGB
Description	A simple baseline for unseen object detection/segmentation with Segment Anything (SAM) and DINOv2. This three-stage approach can work for any object without retraining: Onboarding stage: For each object in the test dataset, we select 42 reference images from "PBR-BlenderProc4BOP" training images and crop the object from these images using the ground-truth modal 2D bounding box.Then we calculate the CLS-token descriptors of the crops using DINOv2. This process generates a set of reference descriptors of size "num_objects x 42 x C" for the testing dataset, where "num_objects" represents the number of test objects, and "C" denotes the descriptor size. Proposal stage: We generate object proposals using SAM (as in the SAM paper). Each proposal is defined by a binary mask and a 2D bounding box of the mask. Matching stage: We calculate the CLS-token DINOv2 descriptors for the SAM proposals and compare them with the reference descriptors using cosine similarity. This process generates a similarity matrix of size "num_objects x 42". We then average the similarity scores over the 42 views to obtain a “ score" of the proposal with respect to each test object. Finally, we assign an object ID to each proposal by selecting the highest score using argmax Important: The method predicts modal masks (covering just the visible object parts). The 2D bounding boxes are not explicitly predicted but calculated from the modal masks. The predicted boxes are therefore modal while the GT boxes used in the BOP evaluation are amodal (covering also the invisible object part), which yields lower detection scores. Additional notes: Although CNOS achieved remarkable results, its matching can sometimes predict inaccurate ObjectIDs, and resulting in missed detections in a few images, particularly for the TUD-L dataset. We recommend that users integrate the (num_instances, object_ID) information provided by Task 4 when using CNOS’s results as the default detections for TUD-L dataset to improve accuracy and reliability.
Computer specifications	V100 16GB