Preprint available on arxiv
- We train a single model for all objects in a dataset with a ResNet34 backbone.
- Training uses a symmetry-aware loss function.
- All datasets are trained for 250k iterations with the Ranger optimizer and with cosine annealing starting at 85% of training iterations. Batchsize of 32 is used. Refinement modules start being optimized at 20% of training iterations.
- Augmentations: color jittering, blur, noise, in-plane rotations and background/foreground replacement cropping.
- We make use of available real images when available except for TLESS.
- Standard detections from CosyPose are used.
- Time measurements refer to the average time for pose estimation + refinement of all objects in an image. Differences between datasets are due to the different number of objects present in the image. Time measurements do not take into account the detection stage.