Submission: CDPNv2_BOP20 (PBR-only & ICP)/YCB-V

Download submission
Submission name
Submission time (UTC) Aug. 19, 2020, 8:21 p.m.
User wangg16
Task 6D localization of seen objects
Dataset YCB-V
Training model type Default
Training image type Synthetic (only PBR images provided for BOP Challenge 2020 were used)
Description
Evaluation scores
AR:0.532
AR_MSPD:0.483
AR_MSSD:0.603
AR_VSD:0.511
average_time_per_image:1.034

Method: CDPNv2_BOP20 (PBR-only & ICP)

User wangg16
Publication Li et al.: CDPN: Coordinates-Based Disentangled Pose Network for Real-Time RGB-Based 6-DoF Object Pose Estimation, ICCV 2019
Implementation https://github.com/LZGMatrix/BOP19_CDPN_2019ICCV/tree/bop2020
Training image modalities RGB
Test image modalities RGB-D
Description

In PBR-only setting, all models are trained only using the provided PBR synthetic data and tested with depth/ICP refinement. For each dataset, we trained a CDPN model for each object.

For detection, different from CDPN in BOP19, we used the FCOS with BackBone of vovnet-V2-57-FPN [1]. We trained a detector for each dataset. The detector was trained for 8 epochs with batch size of 4 on a single GPU, 4 workers, and a learning rate of 1e-3. We used color augmentation similar to AAE [2] during training.

For pose estimation, the difference between our CDPNv2 and the BOP19-version CDPN mainly includes:

  • Domain Randomization: We used stronger domain randomization operations than BOP19. The details will be provided after the deadline.

  • Network Architecture: Considering the organizer provides high-quality PBR synthetic training data in BOP20, we adopt a deeper 34-layer Resnet as the backbone instead of the 18-layer Resnet used in BOP19-version CDPN. Also, the fancy concat structures in BOP19-version CDPN are removed. The input and output resolutions are 256x256 and 64x64 respectively.

  • Training During training, the initial learning rate was 1 × 10−4 and the batch size was 6. We used RMSProp with alpha 0.99 and epsilon 1× 10−8 to optimize the network. The model was trained for 160 epochs in total and the learning rate was divided by 10 every 50 epochs

  • Other implementations, such as the coordinates labels were computed by back-projection from the rendered depth, instead of forward-projection with z-buffer.

[1] https://github.com/aim-uofa/AdelaiDet/tree/master/configs/FCOS-Detection/vovnet

[2] https://github.com/DLR-RM/AugmentedAutoencoder

Computer specifications Intel i7-7700; GPU: GTX 1070; Memory: 16G