BOP: Benchmark for 6D Object Pose Estimation

Method: CDPNv2_BOP20 (RGB-only & ICP)

User	wangg16
Publication	Li et al.: CDPN: Coordinates-Based Disentangled Pose Network for Real-Time RGB-Based 6-DoF Object Pose Estimation, ICCV 2019
Implementation	https://github.com/LZGMatrix/BOP19_CDPN_2019ICCV/tree/bop2020
Training image modalities	RGB
Test image modalities	RGB-D
Description	In this setting, the models are trained in the same manner with the RGB track in BOP19 challenge and tested with depth/ICP refinement. Concretely, for LMO, HB, ICBIN and ITODD datasets, we only use the provided synthetic training data (PBR) in training. While for YCBV, TUDL, TLESS, we use the provided real data and synthetic data (PBR) in training. For each dataset, we trained a CDPN model for each object. For detection, different from CDPN in BOP19, we used the FCOS with BackBone of vovnet-V2-57-FPN [1]. We trained a detector for each dataset. The detector was trained for 8 epochs with batch size of 4 on a single GPU, 4 workers, and a learning rate of 1e-3. We used color augmentation similar to AAE [2] during training. For pose estimation, the difference between our CDPNv2 and the BOP19-version CDPN mainly includes: Domain Randomization: We used stronger domain randomization operations than BOP19. The details will be provided after the deadline. Network Architecture: Considering the organizer provides high-quality PBR synthetic training data in BOP20, we adopt a deeper 34-layer Resnet as the backbone instead of the 18-layer Resnet used in BOP19-version CDPN. Also, the fancy concat structures in BOP19-version CDPN are removed. The input and output resolutions are 256256 and 6464 respectively. Training During training, the initial learning rate was 1 × 10−4 and the batch size was 6. We used RMSProp with alpha 0.99 and epsilon 1× 10−8 to optimize the network. The model was trained for 160 epochs in total and the learning rate was divided by 10 every 50 epochs Other implementations, such as the coordinates labels were computed by back-projection from the rendered depth, instead of forward-projection with z-buffer. [1] https://github.com/aim-uofa/AdelaiDet/tree/master/configs/FCOS-Detection/vovnet [2] https://github.com/DLR-RM/AugmentedAutoencoder
Computer specifications	Intel i7-7700; GPU: GTX 1070; Memory: 16G

Public submissions

	Date	Submission name	Dataset
	2020-08-19 18:32	bbox	TUD-L
	2020-08-19 19:25	bbox	YCB-V
	2020-08-19 21:23	icpv4	LM-O
	2020-08-19 22:03	icpv4	HB
	2020-08-19 22:12	cxt	T-LESS
	2020-08-19 23:21	icpv4	IC-BIN
	2020-08-19 23:38	bbox	ITODD