Submission name | PBR/SAM(H) | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Submission time (UTC) | Oct. 16, 2024, 1:13 p.m. | ||||||||||||||||||||||||||
User | csm8167 | ||||||||||||||||||||||||||
Task | Model-based 2D segmentation of unseen objects | ||||||||||||||||||||||||||
Dataset | ITODD | ||||||||||||||||||||||||||
Description | |||||||||||||||||||||||||||
Evaluation scores |
|
User | csm8167 |
---|---|
Publication | Not yet |
Implementation | |
Training image modalities | None |
Test image modalities | RGB |
Description | Submitted to: BOP Challenge 2024 Training data: None Note: Our F3DT2D (from 3D to 2D) consisits of 4 steps: (1) Extracting 2D templates from 3D model, (2) extracting 2D detections, (3) template/feature matching using the vMF, and (4) confidence-aware scoring. (1) Extracting 2D templates from 3D model: We followes similar approaches like SAM6D [1] and CNOS [2] by utilizing 42 reference templates for each 3D object. These images were selected from the 'PBR-BlenderProc4BOP' training dataset, and the 2D objects were cropped using the ground-truth modal 2D bounding boxes. (2) Extracting 2D detections: In the detection extraction stage, we first obtain the object bounding box using Grounding DINO [3] with the 'objects' prompt, and then generate the segmentation for each object using Segment Anything [4], prompted by the bounding box from the Grounding DINO detection. (3) Template/feature matching using the vMF: We extract the features from templates and detections using feature descriptor DINOv2 [5] and get scores calculating the similairties. Then, we employ the von Mises-Fisher (vMF) distribution based on the assumption that high scores are concentrated near the target template view. By using the vMF distribution, we can give adaptive weights for specific templates, enhancing the matching accuracy. (4) Confidence-aware scoring: Finally, we update the matching score by incorporating the detection confidence to consider the uncertainty of the detection results. [1] Lin, Jiehong, et al. "Sam-6d: Segment anything model meets zero-shot 6d object pose estimation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. [2] Nguyen, Van Nguyen, et al. "Cnos: A strong baseline for cad-based novel object segmentation." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. [3] Liu, Shilong, et al. "Grounding dino: Marrying dino with grounded pre-training for open-set object detection." arXiv preprint arXiv:2303.05499 (2023). [4] Kirillov, Alexander, et al. "Segment anything." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. [5] Oquab, Maxime, et al. "Dinov2: Learning robust visual features without supervision." arXiv preprint arXiv:2304.07193 (2023). Authors: Sungmin Cho (cho_sm@netmarble.com), Sungbum Park (spark0916@netmarble.com), and Insoo Oh (ioh@netmarble.com) If you have any questions, please feel free to contact us Thank you |
Computer specifications | RTX 4090 |