Submission: 3PT-Detection (f.k.a IPT)/HOT3D

Download submission
Submission name
Submission time (UTC) Oct. 1, 2025, 8:43 p.m.
User IPT
Task Model-based 2D detection of unseen objects
Dataset HOT3D
Description
Evaluation scores
AP:0.481
AP50:0.583
AP75:0.560
AP_large:0.542
AP_medium:0.411
AP_small:0.040
AR1:0.610
AR10:0.656
AR100:0.657
AR_large:0.764
AR_medium:0.561
AR_small:0.099
average_time_per_image:112.256

Method: 3PT-Detection (f.k.a IPT)

User IPT
Publication Anonymous
Implementation PyTorch
Training image modalities RGB
Test image modalities RGB
Description

IPT-Detection: A Pretrained Transformer for CAD Prompted Detection.

We are submitting IPT-Detection to the BOP Challenge 2025.

This Foundation Models is designed for one-shot, image- and CAD-prompted object detection. It employs a vision transformer backbone to simultaneously regress 2D bounding boxes, object classifications, and a coarse orientation estimate from a single RGB image.

Dataset and Training Strategy

Our model is trained exclusively on large-scale synthetic datasets. This data is generated by rendering scenes in Blender, utilizing a diverse collection of over 100,000 unique CAD models collected from the public CAD model collections and other sources.

The network is trained on a substantial dataset comprising over 500,000 synthetically rendered images to ensure robustness and generalization across a wide range of object instances and environmental conditions.

Onboarding Procedure (less than 5 minutes per object)

For each new CAD model, a set of reference templates is rendered, showing the CAD model in various canonical orientations. These templates are embedded into tokens that are then used for matching. The process takes 7-15s / CAD models. No training or fine-tuning is required.

For the CAD model-free approach, we used real masked photographs of the object instead of the rendered templates.

Note on runtime:

The foundation model was designed for industrial applications where there is a single CAD model being searched for in a fixed depth range. Since we cannot make those assumptions in this challenge, we ended up running the model at 3 different depth ranges times per CAD model. So if a dataset had 40 potential cad models, we ended up having to run the model ~120 times per scene. The inference time of a single forward pass is only 0.23-0.9s on a V100 depending on image resolution and depth range.

Authors: Temporary Anonymity

Computer specifications V100