BOP: Benchmark for 6D Object Pose Estimation

Submission: 3PT-Detection (f.k.a IPT)/HOT3D

Download submission

Submission name

Submission time (UTC)

Oct. 1, 2025, 8:43 p.m.

User

IPT

Task

Model-based 2D detection of unseen objects

Dataset

HOT3D

Description

Evaluation scores

AP:	0.481
AP50:	0.583
AP75:	0.560
AP_large:	0.542
AP_medium:	0.411
AP_small:	0.040
AR1:	0.610
AR10:	0.656
AR100:	0.657
AR_large:	0.764
AR_medium:	0.561
AR_small:	0.099
average_time_per_image:	112.256

Method: 3PT-Detection (f.k.a IPT)

User	IPT
Publication	Anonymous
Implementation	PyTorch
Training image modalities	RGB
Test image modalities	RGB
Description	IPT-Detection: A Pretrained Transformer for CAD Prompted Detection. We are submitting IPT-Detection to the BOP Challenge 2025. This Foundation Models is designed for one-shot, image- and CAD-prompted object detection. It employs a vision transformer backbone to simultaneously regress 2D bounding boxes, object classifications, and a coarse orientation estimate from a single RGB image. Dataset and Training Strategy Our model is trained exclusively on large-scale synthetic datasets. This data is generated by rendering scenes in Blender, utilizing a diverse collection of over 100,000 unique CAD models collected from the public CAD model collections and other sources. The network is trained on a substantial dataset comprising over 500,000 synthetically rendered images to ensure robustness and generalization across a wide range of object instances and environmental conditions. Onboarding Procedure (less than 5 minutes per object) For each new CAD model, a set of reference templates is rendered, showing the CAD model in various canonical orientations. These templates are embedded into tokens that are then used for matching. The process takes 7-15s / CAD models. No training or fine-tuning is required. For the CAD model-free approach, we used real masked photographs of the object instead of the rendered templates. Note on runtime: The foundation model was designed for industrial applications where there is a single CAD model being searched for in a fixed depth range. Since we cannot make those assumptions in this challenge, we ended up running the model at 3 different depth ranges times per CAD model. So if a dataset had 40 potential cad models, we ended up having to run the model ~120 times per scene. The inference time of a single forward pass is only 0.23-0.9s on a V100 depending on image resolution and depth range. Authors: Temporary Anonymity
Computer specifications	V100