Submission: F3DT2D/ITODD/PBR/SAM(H)

Download submission
Submission name PBR/SAM(H)
Submission time (UTC) Oct. 16, 2024, 1:13 p.m.
User csm8167
Task Model-based 2D segmentation of unseen objects
Dataset ITODD
Description
Evaluation scores
AP:0.376
AP50:0.584
AP75:0.409
AP_large:0.399
AP_medium:0.214
AP_small:-1.000
AR1:0.178
AR10:0.495
AR100:0.501
AR_large:0.528
AR_medium:0.264
AR_small:-1.000
average_time_per_image:1.033

Method: F3DT2D

User csm8167
Publication Not yet
Implementation
Training image modalities None
Test image modalities RGB
Description

Submitted to: BOP Challenge 2024

Training data: None

Note: Our F3DT2D (from 3D to 2D) consisits of 4 steps: (1) Extracting 2D templates from 3D model, (2) extracting 2D detections, (3) template/feature matching using the vMF, and (4) confidence-aware scoring.

(1) Extracting 2D templates from 3D model:

We followes similar approaches like SAM6D [1] and CNOS [2] by utilizing 42 reference templates for each 3D object. These images were selected from the 'PBR-BlenderProc4BOP' training dataset, and the 2D objects were cropped using the ground-truth modal 2D bounding boxes.

(2) Extracting 2D detections:

In the detection extraction stage, we first obtain the object bounding box using Grounding DINO [3] with the 'objects' prompt, and then generate the segmentation for each object using Segment Anything [4], prompted by the bounding box from the Grounding DINO detection.

(3) Template/feature matching using the vMF:

We extract the features from templates and detections using feature descriptor DINOv2 [5] and get scores calculating the similairties. Then, we employ the von Mises-Fisher (vMF) distribution based on the assumption that high scores are concentrated near the target template view. By using the vMF distribution, we can give adaptive weights for specific templates, enhancing the matching accuracy.

(4) Confidence-aware scoring:

Finally, we update the matching score by incorporating the detection confidence to consider the uncertainty of the detection results.

[1] Lin, Jiehong, et al. "Sam-6d: Segment anything model meets zero-shot 6d object pose estimation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. [2] Nguyen, Van Nguyen, et al. "Cnos: A strong baseline for cad-based novel object segmentation." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. [3] Liu, Shilong, et al. "Grounding dino: Marrying dino with grounded pre-training for open-set object detection." arXiv preprint arXiv:2303.05499 (2023). [4] Kirillov, Alexander, et al. "Segment anything." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. [5] Oquab, Maxime, et al. "Dinov2: Learning robust visual features without supervision." arXiv preprint arXiv:2304.07193 (2023).

Authors: Sungmin Cho (cho_sm@netmarble.com), Sungbum Park (spark0916@netmarble.com), and Insoo Oh (ioh@netmarble.com)

If you have any questions, please feel free to contact us Thank you

Computer specifications RTX 4090