Submission: mask matcher/IPD/Flanders Make

Download submission
Submission name Flanders Make
Submission time (UTC) Oct. 1, 2025, 11:59 p.m.
User luxmanramamoorthy
Task Model-based 6D detection of unseen objects
Dataset IPD
Description
Evaluation scores
AP:0.496
AP_25:0.364
AP_25_mm:0.082
AP_MSPD:0.628
AP_MSSD:0.364
AP_MSSD_mm:0.082
average_time_per_image:41.170

Method: mask matcher

User luxmanramamoorthy
Publication
Implementation python
Training image modalities RGB
Test image modalities RGB-D
Description

Methodology: Instance Segmentation + Feature Matching for 6D Object Pose Estimation

Our pipeline for 6D object pose estimation in the context of the BOP Challenge and BOP IPD dataset :

Our approach combines instance segmentation, feature matching, and 3D correspondence mapping to estimate the full 6D pose of objects in unseen scenes.


1. Overview

The core idea of our method is to leverage instance segmentation and robust feature matching to find correspondences between detected objects in a scene and reference objects with known poses.
Using depth information, we map these correspondences into 3D space, enabling the estimation of the object's pose via rigid transformation alignment.

The pipeline consists of the following stages:

  1. Instance Segmentation — detect and isolate object instances.
  2. Mask Extraction — extract individual object masks for feature matching.
  3. Feature Matching — match detected masks with reference masks using SuperGlue.
  4. Pixel-to-3D Mapping — convert pixel correspondences into 3D points.
  5. Pose Transfer — compute the full 6D pose from 3D correspondences.

2. Detailed Pipeline

2.1 Instance Segmentation

We train a Mask R-CNN model on the dataset to perform instance segmentation.
Given an RGB test image, Mask R-CNN outputs for each detected object:

  • Bounding box
  • Class label
  • Polygon mask

Output: segmented polygon masks for each detected object.


2.2 Mask Extraction

From the segmentation results, we extract individual object masks.
Each mask isolates the object region from the background, allowing targeted feature extraction and matching.


2.3 Feature Matching

For each extracted mask:

  1. Extract the corresponding pixel region from the RGB image.
  2. Perform point-to-point feature matching between the detected mask and reference masks using SuperGlue.
    SuperGlue provides robust correspondences under varying viewpoints and occlusion conditions.
  3. The result is a set of matched pixel pairs between the detected object and reference object masks.

2.4 Pixel-to-3D Mapping

We use the depth image to map matched pixel locations into 3D coordinates in the camera coordinate frame.

[ \mathbf{p} = K^{-1} \cdot (u, v, 1) \cdot d(u, v) ]

Where: - (K) is the camera intrinsic matrix - ((u, v)) are pixel coordinates - (d(u, v)) is the depth value at that pixel

Result: a set of 3D point correspondences: [ { (\mathbf{P}_\text{test}^i, \mathbf{P}_\text{ref}^i) } ]


2.5 Pose Transfer

Given: - Known 6D pose of the reference mask ((\mathbf{R}_\text{ref}, \mathbf{t}_\text{ref})) - Matched 3D point correspondences

We compute the rigid transformation ((\mathbf{R}, \mathbf{t})) that aligns the reference object to the detected object:

[ \mathbf{R}, \mathbf{t} = \arg\min_{\mathbf{R}, \mathbf{t}} \sum_i \| \mathbf{P}_\text{test}^i - (\mathbf{R} \mathbf{P}_\text{ref}^i + \mathbf{t}) \|^2 ]

This is typically solved using the Kabsch algorithm.

Output:
Estimated 6D pose ((\mathbf{R}, \mathbf{t})) of the detected object.


3. Pipeline Diagram

The overall pipeline can be summarized as:

Input RGB Image + Depth Image │ ▼ ┌────────────────────┐ │ 1. Instance │ │ Segmentation │ │ (Mask R-CNN) │ └────────────────────┘ │ ▼ ┌────────────────────┐ │ 2. Mask Extraction │ └────────────────────┘ │ ▼ ┌─────────────────────────────────────────────┐ │ 3. Feature Matching │ │ - Use SuperGlue to match mask features │ │ between detected object and reference │ │ masks with known poses │ └─────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────┐ │ 4. Pixel-to-3D Mapping │ │ - Map matched pixels to 3D points using │ │ depth images and camera intrinsics │ └─────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────┐ │ 5. Pose Transfer │ │ - Use 3D point correspondences and known │ │ reference pose to compute the full 6D │ │ pose of the detected object │ └─────────────────────────────────────────────┘ │ ▼

Output: Estimated 6D pose (Rotation + Translation)

4. Advantages of the Approach

  • Generalization: Can handle unseen objects by leveraging geometry rather than direct regression.
  • Robustness: SuperGlue provides high-quality feature correspondences even under occlusion and viewpoint changes.
  • Integration: Combines segmentation, feature matching, and depth mapping into a unified pipeline.
  • Reusability: Known object poses can be reused without retraining for new detections.

5. Limitations and Future Work

  • Segmentation Quality: Errors in mask generation propagate to pose estimation.
  • Computational Cost: SuperGlue and depth mapping add processing overhead.
  • Depth Noise Sensitivity: Depth sensor noise can affect accuracy of 3D correspondences.

Future directions include: - Incorporating pose refinement networks to improve accuracy. - Using multi-view fusion for improved robustness. - Adding uncertainty estimation to quantify pose confidence.


6. Summary

Our approach introduces a modular pipeline for 6D pose estimation by combining instance segmentation, SuperGlue feature matching, and 3D correspondence mapping.
This method is particularly suited to datasets such as the BOP IPD dataset where generalization to unseen objects and robustness in industrial environments are key challenges.

Computer specifications Dell Pro Max T2 | Ubuntu 24.04 LTS Basis Dell Pro Max Tower T2 (FCT2250) CTO Base Processor Intel Core Ultra 7 265K (30 MB cache, 20 cores, 20 threads, 3.3 GHz to 5.5 GHz, 125W) Taalpakket voor besturingssysteem No Factory Install Language Software Systeembeheer Intel vPro Disabled Intel – Prestaties/mogelijkheden iRST not selected Chassisopties Dell Pro Max Tower T2 with 1500W (80 Plus Platinum) PSU Geheugen 64GB: 2 x 32 GB, DDR5, 5600 MT/s, non-ECC Videokaart NVIDIA RTX 4000 ADA, 20 GB GDDR6, 4 DP Graphics Holder T2 Graphic Card Holder Thermische koeling No Fans Included Thermische koeling Premium CPU Air Cooler Bay 9 van storageapparaat 2TB SSD TLC with DRAM