BOP: Benchmark for 6D Object Pose Estimation

Method: SAM6D-BOP2023-CNOSmask

User	jiehonglin
Publication	Not yet.
Implementation	Pytorch. We will release the codes at https://github.com/JiehongLin/SAM6D.
Training image modalities	RGB-D
Test image modalities	RGB-D
Description	Submitted to: BOP Challenge 2023 Training data: Provided MegaPose-GSO and MegaPose-ShapeNetCore Onboarding data: No. Used 3D models: Default, CAD. Notes: Authors: Jiehong Lin, Lihua Liu, Dekun Lu and Kui Jia. Used 2D detection/segmentation method:the default method of CNOS-FastSAM (https://bop.felk.cvut.cz/method_info/370/). Mehtod: Our employed model consists of three main modules. Description Extraction based on ViT [1]. We render 8 image templates for each object CAD model, use ViT to extract per-pixel descriptions for templates and the target scene image, and sample 2048 point descriptions for both the object observation and its CAD. Coarse Matching based on GeoTransformer [2]. For efficiency, we sample M subsets, each consisting of 196 points, for both the object observation and its CAD. We then employ M parallel Geometric Transformers to search for feature correspondence within each pair of subsets. From this process, transformation hypotheses are calculated. The transformation that best aligns the observation with the CAD model is selected as the initial pose. In this submission, we set M = 4. Finer Matching based on corss-attention modules. The architecture of this module is similar to that of the coarse one. It utilizes parallel cross-attention modules to handle N pairs of point subsets for both the object observation and its CAD. In this module, we transform the observed points using the initial pose and use them as positional encodings for the point descriptions. This enables the module to iteratively refine the point correspondence with T iterations. In this submission, we set N = 4 and T = 4. [1] Dosovitskiy et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020). [2] Qin et al. "GeoTransformer: Fast and Robust Point Cloud Registration With Geometric Transformer." IEEE Transactions on Pattern Analysis and Machine Intelligence (2023). Evaluation: We use a single model for all of the core datasets. The reported time for each dataset is the average processing time per image, calculated using a data batch of 16 segmented instances.
Computer specifications	GeForce RTX 3090 24G； Inter (R) Xeon (R) CPU E5-2678 v3 @ 2.50Ghz

Submission: SAM6D-BOP2023-CNOSmask/TUD-L

Method: SAM6D-BOP2023-CNOSmask