Submission: 3PT-Pose-Industrial (f.k.a. IPT)/IPD

Download submission
Submission name
Submission time (UTC) Oct. 1, 2025, 8:56 p.m.
User IPT
Task Model-based 6D detection of unseen objects
Dataset IPD
Description
Evaluation scores
AP:0.923
AP_25:0.914
AP_25_mm:0.822
AP_MSPD:0.933
AP_MSSD:0.914
AP_MSSD_mm:0.822
average_time_per_image:160.481

Method: 3PT-Pose-Industrial (f.k.a. IPT)

User IPT
Publication Anonymous
Implementation PyTorch
Training image modalities RGB
Test image modalities RGB
Description

IPT-Pose-Industrial: A Two Stage Transformer for Pose Estimation

We are submitting IPT-Pose-Industrial to the BOP Challenge 2025.

This is a two stage foundation model, composed of an object detector (IPT-Detection) and a pose refinement network (IPT-Pose). IPT is a one-shot, image- and CAD-prompted object detection network. It employs a vision transformer backbone to simultaneously regress 2D bounding boxes, coarse object orientations, and object classifications. Initial poses are then estimated from IPT's outputs and passed to the pose refinement network which uses point-to-point correspondences across multiple views to refine the pose.

Dataset and Training Strategy

Our model is trained exclusively on large-scale synthetic datasets. This data is generated by rendering scenes in Blender, utilizing a diverse collection of over 100,000 unique CAD models collected from the public CAD model collections and other sources.

The network is trained on a substantial dataset comprising over 500,000 synthetically rendered images to ensure robustness and generalization across a wide range of object instances and environmental conditions.

Onboarding Procedure (less than 5 minutes per object)

For each new CAD model, a set of reference templates is generated, showing the CAD model in various canonical orientations. These templates serve as a reference for the model during IPT's inference process.

Specifics

We use 4 views except in the case of IPD where we only use 3 views. We only use RGB from each view, so we are treating each sensor like it’s RGB-Only. Pose accuracy comes from pixel-level accurate multi-view pose refinement.

Authors: Temporary Anonymity

Computer specifications V100