Method: 3PT-Detection (f.k.a IPT)

User IPT
Publication Anonymous
Implementation PyTorch
Views Single
Test image modalities RGB
Description

IPT-Detection: A Pretrained Transformer for CAD Prompted Detection.

We are submitting IPT-Detection to the BOP Challenge 2025.

This Foundation Models is designed for one-shot, image- and CAD-prompted object detection. It employs a vision transformer backbone to simultaneously regress 2D bounding boxes, object classifications, and a coarse orientation estimate from a single RGB image.

Dataset and Training Strategy

Our model is trained exclusively on large-scale synthetic datasets. This data is generated by rendering scenes in Blender, utilizing a diverse collection of over 100,000 unique CAD models collected from the public CAD model collections and other sources.

The network is trained on a substantial dataset comprising over 500,000 synthetically rendered images to ensure robustness and generalization across a wide range of object instances and environmental conditions.

Onboarding Procedure (less than 5 minutes per object)

For each new CAD model, a set of reference templates is rendered, showing the CAD model in various canonical orientations. These templates are embedded into tokens that are then used for matching. The process takes 7-15s / CAD models. No training or fine-tuning is required.

For the CAD model-free approach, we used real masked photographs of the object instead of the rendered templates.

Note on runtime:

The foundation model was designed for industrial applications where there is a single CAD model being searched for in a fixed depth range. Since we cannot make those assumptions in this challenge, we ended up running the model at 3 different depth ranges times per CAD model. So if a dataset had 40 potential cad models, we ended up having to run the model ~120 times per scene. The inference time of a single forward pass is only 0.23-0.9s on a V100 depending on image resolution and depth range.

Authors: Temporary Anonymity

Computer specifications V100

Public submissions

Date Submission name Dataset
2025-10-01 20:23 - YCB-V
2025-10-01 20:25 - IC-BIN
2025-10-01 20:27 - TUD-L
2025-10-01 20:28 - T-LESS
2025-10-01 20:28 - LM-O
2025-10-01 20:30 - ITODD
2025-10-01 20:30 - HB
2025-10-01 20:30 - IPD
2025-10-01 20:30 - XYZ-IBD
2025-10-01 20:31 - ITODD-MV
2025-10-01 20:43 - HOT3D
2025-10-01 20:45 - HANDAL
2025-10-01 20:45 - HOPEv2
2025-10-01 20:48 - IPD
2025-10-01 20:48 - ITODD-MV
2025-10-01 20:48 - XYZ-IBD
2025-10-01 21:18 - HOPEv2
2025-10-01 21:49 - HANDAL
2025-10-01 23:26 - HOT3D