BOP: Benchmark for 6D Object Pose Estimation

Submission: CDPNv2_BOP20 (PBR-only & RGB-only)/LM-O/Zhigang-CDPNv2 (MODE 2, FCOS)

Download submission

Submission name

Zhigang-CDPNv2 (MODE 2, FCOS)

Submission time (UTC)

Aug. 19, 2020, 12:15 p.m.

User

ZhigangLi

Task

Model-based 6D localization of seen objects

Dataset

LM-O

Training model type

Default

Training image type

Synthetic (only PBR images provided for BOP Challenge 2020 were used)

Description

Evaluation scores

AR:	0.624
AR_MSPD:	0.815
AR_MSSD:	0.612
AR_VSD:	0.445
average_time_per_image:	0.163

Method: CDPNv2_BOP20 (PBR-only & RGB-only)

User	ZhigangLi
Publication	CDPN: Coordinates-Based Disentangled Pose Network for Real-Time RGB-Based 6-DoF Object Pose Estimation
Implementation	https://github.com/LZGMatrix/BOP19_CDPN_2019ICCV/tree/bop2020
Training image modalities	RGB
Test image modalities	RGB
Description	In PBR-only setting, all models are trained only using the provided PBR synthetic data. For each dataset, we trained a CDPN model for each object. For detection, different from CDPN in BOP19, we used the FCOS with BackBone of vovnet-V2-57-FPN [1]. We trained a detector for each dataset. The detector was trained for 8 epochs with batch size of 4 on a single GPU, 4 workers, and a learning rate of 1e-3. We used color augmentation similar to AAE [2] during training. For pose estimation, the difference between our CDPNv2 and the BOP19-version CDPN mainly including: Domain Randomization Besides the color augmentation similar to AAE [2], we also used the truncation domain randomization in [3] to improve the system robustness to occlusion. Network Architecture Considering the organizer provides high-quality PBR synthetic training data in BOP20, we adopt a deeper 34-layer Resnet as the backbone instead of the 18-layer Resnet used in BOP19-version CDPN. Also, the fancy concat structures in BOP19-version CDPN are removed. The input and output resolutions are 256×256 and 64×64 respectively. Training During training, the initial learning rate was 1 × 10−4 and the batch size was 6. We used RMSProp with alpha 0.99 and epsilon 1× 10−8 to optimize the network. The model was trained for 160 epochs in total and the learning rate was divided by 10 every 50 epochs Other implementations, such as the coordinates labels were computed by back-projection from the rendered depth, instead of forward-projection with z-buffer. [1] https://github.com/aim-uofa/AdelaiDet/tree/master/configs/FCOS-Detection/vovnet [2] https://github.com/DLR-RM/AugmentedAutoencoder [3] https://arxiv.org/abs/2008.08391
Computer specifications	Intel i7-7700; GPU: GTX 1070; Memory: 16G