Submission: SAM6D-BOP2023-CNOSmask/TUD-L

Download submission
Submission name
Submission time (UTC) Sept. 28, 2023, 4:34 a.m.
User jiehonglin
Task 6D localization of unseen objects
Dataset TUD-L
Description
Evaluation scores
AR:0.794
AR_MSPD:0.824
AR_MSSD:0.833
AR_VSD:0.726
average_time_per_image:2.530

Method: SAM6D-BOP2023-CNOSmask

User jiehonglin
Publication Not yet.
Implementation Pytorch. We will release the codes at https://github.com/JiehongLin/SAM6D.
Training image modalities RGB-D
Test image modalities RGB-D
Description

Submitted to: BOP Challenge 2023

Training data: Provided MegaPose-GSO and MegaPose-ShapeNetCore

Onboarding data: No.

*Used 3D models: * Default, CAD.

Notes:

  • Authors: Jiehong Lin, Lihua Liu, Dekun Lu and Kui Jia.

  • Used 2D detection/segmentation method:the default method of CNOS-FastSAM (https://bop.felk.cvut.cz/method_info/370/).

  • Mehtod: Our employed model consists of three main modules.

    • Description Extraction based on ViT [1]. We render 8 image templates for each object CAD model, use ViT to extract per-pixel descriptions for templates and the target scene image, and sample 2048 point descriptions for both the object observation and its CAD.

    • Coarse Matching based on GeoTransformer [2]. For efficiency, we sample M subsets, each consisting of 196 points, for both the object observation and its CAD. We then employ M parallel Geometric Transformers to search for feature correspondence within each pair of subsets. From this process, transformation hypotheses are calculated. The transformation that best aligns the observation with the CAD model is selected as the initial pose. In this submission, we set M = 4.

    • Finer Matching based on corss-attention modules. The architecture of this module is similar to that of the coarse one. It utilizes parallel cross-attention modules to handle N pairs of point subsets for both the object observation and its CAD. In this module, we transform the observed points using the initial pose and use them as positional encodings for the point descriptions. This enables the module to iteratively refine the point correspondence with T iterations. In this submission, we set N = 4 and T = 4.

[1] Dosovitskiy et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).

[2] Qin et al. "GeoTransformer: Fast and Robust Point Cloud Registration With Geometric Transformer." IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).

  • Evaluation: We use a single model for all of the core datasets. The reported time for each dataset is the average processing time per image, calculated using a data batch of 16 segmented instances.
Computer specifications GeForce RTX 3090 24G; Inter (R) Xeon (R) CPU E5-2678 v3 @ 2.50Ghz