| Submission name | |||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Submission time (UTC) | Oct. 1, 2025, 8:43 p.m. | ||||||||||||||||||||||||||
| User | IPT | ||||||||||||||||||||||||||
| Task | Model-based 2D detection of unseen objects | ||||||||||||||||||||||||||
| Dataset | HOT3D | ||||||||||||||||||||||||||
| Description | |||||||||||||||||||||||||||
| Evaluation scores |
|
| User | IPT |
|---|---|
| Publication | Anonymous |
| Implementation | PyTorch |
| Training image modalities | RGB |
| Test image modalities | RGB |
| Description | IPT-Detection: A Pretrained Transformer for CAD Prompted Detection.We are submitting IPT-Detection to the BOP Challenge 2025. This Foundation Models is designed for one-shot, image- and CAD-prompted object detection. It employs a vision transformer backbone to simultaneously regress 2D bounding boxes, object classifications, and a coarse orientation estimate from a single RGB image. Dataset and Training StrategyOur model is trained exclusively on large-scale synthetic datasets. This data is generated by rendering scenes in Blender, utilizing a diverse collection of over 100,000 unique CAD models collected from the public CAD model collections and other sources. The network is trained on a substantial dataset comprising over 500,000 synthetically rendered images to ensure robustness and generalization across a wide range of object instances and environmental conditions. Onboarding Procedure (less than 5 minutes per object)For each new CAD model, a set of reference templates is rendered, showing the CAD model in various canonical orientations. These templates are embedded into tokens that are then used for matching. The process takes 7-15s / CAD models. No training or fine-tuning is required. For the CAD model-free approach, we used real masked photographs of the object instead of the rendered templates. Note on runtime:The foundation model was designed for industrial applications where there is a single CAD model being searched for in a fixed depth range. Since we cannot make those assumptions in this challenge, we ended up running the model at 3 different depth ranges times per CAD model. So if a dataset had 40 potential cad models, we ended up having to run the model ~120 times per scene. The inference time of a single forward pass is only 0.23-0.9s on a V100 depending on image resolution and depth range. Authors: Temporary Anonymity |
| Computer specifications | V100 |