## BOP Challenge 2019

• 27/Jan/2020 - Submissions to the BOP Challenge 2019 have been re-evaluated.
• 28/Oct/2019 - The winners of the BOP Challenge 2019 have been announced! The submission form stays open and snapshots of the leaderboard will be presented and discussed at the next R6D workshops. A report with an analysis of the BOP Challenge 2019 results is in preparation.
Recent Submissions:

The challenge is on the task of 6D localization of a varying number of instances of a varying number of objects in a single RGB-D image (the ViVo task).

Training Input: At training time, method $M$ learns using a training set, $T = \{T_o\}$, where $o$ is an object identifier. Training data $T_o$ may have different forms – a 3D mesh model of the object or a set of RGB-D images (synthetic or real) showing object instances in known 6D poses.

Test Input: At test time, method $M$ is provided with image $I$ and list $L = [(o_1, n_1), ..., (o_m, n_m)]$, where $n_i$ is the number of instances of object $o_i$ present in image $I$.

Test Output: Method $M$ produces list $E = [E_1, \dots, E_m]$, where $E_i$ is a list of $n_i$ pose estimates for instances of object $o_i$. Each estimate is given by a 3x3 rotation matrix, $\mathbf{R}$, a 3x1 translation vector, $\mathbf{t}$, and a confidence score, $s$.

The ViVo task is referred to as the 6D localization problem in [2]. In the BOP paper [1], methods were evaluated on a different task – 6D localization of a single instance of a single object (the SiSo task), which was chosen because it allowed to evaluate all relevant methods out of the box. Since then, the state of the art has advanced and we have moved to the more challenging ViVo task.

### 2. Datasets

#### 2.1 Content of the Datasets

Multiple datasets are used for the evaluation. Every dataset includes 3D object models and training and test RGB-D images annotated with ground-truth 6D object poses and intrinsic camera parameters. Some datasets include also validation images – in this case, the ground-truth 6D object poses are publicly available only for the validation images, not for the test images. The 3D object models were created manually or using KinectFusion-like systems for 3D surface reconstruction [6]. The training images show individual objects from different viewpoints and are either captured by an RGB-D/Gray-D sensor or obtained by rendering of the 3D object models. The test images were captured in scenes with graded complexity, often with clutter and occlusion. The datasets are provided in the BOP format.

#### 2.2 Training Data

For training, method $M$ can use the provided object models and training images and can render extra training images using the object models. Not a single pixel of test images may be used in training, nor the individual ground-truth poses or object masks provided for the test images. The range of all ground-truth poses in the test images, which is provided in file dataset_params.py in the BOP Toolkit, is the only information about the test set that can be used during training.

#### 2.3 List of Datasets

Core Datasets: LM-O, T-LESS, TUD-L, IC-BIN, ITODD, HB, YCB-V. Method $M$ needs to be evaluated of these 7 datasets to be considered for the main awards.
Other Datasets: LM, RU-APC, IC-MI, TYO-L.

Only subsets of the original datasets are used to speed up the evaluation. The subsets are defined in files test_targets_bop19.json provided with the datasets.

### 3. Awards

The following awards will be presented at the 5th International Workshop on Recovering 6D Object Pose at ICCV 2019:

1. The Overall Best Method – The top-performing method on the 7 core datasets.
2. The Best RGB-Only Method – The top-performing RGB-only method on the 7 core datasets.
3. The Best Fast Method – The top-performing method on the 7 core datasets with the average running time per image below 1s.
4. The Best Open Source Method – The top-performing method on the 7 core datasets whose source code is publicly available.
5. The Best Method on Dataset D – The top-performing method on each of the 11 available datasets.

To be considered for the awards, authors need to provide an implementation of the method (source code or a binary file with instructions) which will be validated.

### 4. Evaluation Methodology

#### 4.1 Pose Error Functions

The error of an estimated pose $\hat{\textbf{P}}$ w.r.t. the ground-truth pose $\bar{\textbf{P}}$ of an object model $O$ is measured by three pose error functions defined below. Their implementation is available in the BOP Toolkit.

Visible Surface Discrepancy (VSD) [1,2]:

$e_\mathrm{VSD}(\hat{S}, \bar{S}, S_I, \hat{V}, \bar{V}, \tau) = \mathrm{avg}_{p \in \hat{V} \cup \bar{V}} \begin{cases} 0 & \text{if} \, p \in \hat{V} \cap \bar{V} \, \wedge \, |\hat{S}(p) - \bar{S}(p)| < \tau \\ 1 & \text{otherwise} \end{cases}$

$\hat{S}$ and $\bar{S}$ are distance maps obtained by rendering the object model $O$ in the estimated pose $\hat{\textbf{P}}$ and the ground-truth pose $\bar{\textbf{P}}$ respectively. As in [1,2], the distance maps are compared with the distance map $S_I$ of the test image $I$ to obtain the visibility masks $\hat{V}$ and $\bar{V}$, i.e. the sets of pixels where the model $O$ is visible in image $I$. Estimation of the visibility masks has been modified – at pixels with no depth measurements, an object is now considered visible (it was considered not visible in [1,2]). This modification allows evaluating poses of glossy objects from the ITODD dataset whose surface is not always captured by the depth sensors. $\tau$ is a misalignment tolerance. See Section 2.2 of [1] for details.

Maximum Symmetry-Aware Surface Distance (MSSD) [3]:

$e_{\text{MSSD}} = \text{min}_{\textbf{S} \in S_O} \text{max}_{\textbf{x} \in O} \Vert \hat{\textbf{P}}\textbf{x} - \bar{\textbf{P}}\textbf{S}\textbf{x} \Vert_2$

$S_O$ is a set of symmetry transformations of object model $O$ (see Section 4.2). The maximum distance is relevant for robotic manipulation, where the maximum surface deviation strongly indicates the chance of a successful grasp. The maximum distance is also less dependent on the sampling strategy of the model vertices than the average distance used in ADD/ADI [2, 5], which tends to be dominated by higher-frequency surface parts.

Maximum Symmetry-Aware Projection Distance (MSPD):

$e_{\text{MSPD}} = \text{min}_{\textbf{S} \in S_O} \text{max}_{\textbf{x} \in O} \Vert \text{proj}( \hat{\textbf{P}}\textbf{x} ) - \text{proj}( \bar{\textbf{P}}\textbf{S}\textbf{x} ) \Vert_2$

$\text{proj}$ is the 2D projection operation (the result is in pixels) and the meaning of the other symbols is as in MSSD. Compared to the 2D projection [4], MSPD considers object symmetries and replaces the average by the maximum distance to increase robustness against the model sampling. As MSPD does not evaluate the alignment along the optical axis (Z axis) and measures only the perceivable discrepancy, it is relevant for augmented reality applications and suitable for the evaluation of RGB-only methods.

##### 4.2 Identifying Object Symmetries

The set of potential symmetry transformations (used in MSSD and MSPD) is defined as $S'_O = \{\textbf{S}: h(O, \textbf{S}O) < \varepsilon \}$, where $h$ is the Hausdorff distance calculated between the vertices of object model $O$. The allowed deviation is bounded by $\varepsilon = \text{max}(15\,mm, 0.1d)$, where $d$ is the diameter of model $O$ (the largest distance between any pair of model vertices) and the truncation at $15\,mm$ avoids breaking the symmetries by too small details. The set of symmetry transformations $S_O$ is a subset of $S'_O$ and consists of those symmetry transformations which cannot be resolved by the model texture (decided subjectively).

Set $S_O$ covers both discrete and continuous rotational symmetries. The continuous rotational symmetries are discretized such as the vertex which is the furthest from the axis of symmetry travels not more than $1\%$ of the object diameter between two consecutive rotations. The symmetry transformations are stored in files models_info.json provided with the datasets.

#### 4.3 Performance Score

The performance of method $M$ w.r.t. pose error function $e_{\text{VSD}}$ is measured by the average recall $\text{AR}_{\text{VSD}}$ defined as the average of the recall rates for $\tau$ ranging from $5\%$ to $50\%$ of the object diameter with a step of $5\%$, and for the threshold of correctness $\theta_{\text{VSD}}$ ranging from $0.05$ to $0.5$ with a step of $0.05$. The recall rate is the fraction of annotated object instances for which a correct object pose was estimated. A pose estimate is considered correct if $e_{\text{VSD}} < \theta_{\text{VSD}}$.

Similarly, the performance w.r.t. $e_{\text{MSSD}}$ is measured by the average recall $\text{AR}_{\text{MSSD}}$ defined as the average of the recall rates for the threshold of correctness $\theta_{\text{MSSD}}$ ranging from $5\%$ to $50\%$ of the object diameter with a step of $5\%$. The performance w.r.t. $e_{\text{MSPD}}$ is measured by $\text{AR}_{\text{MSPD}}$ defined as the average of the recall rates for $\theta_{\text{MSPD}}$ ranging from $5r\,px$ to $50r\,px$ with a step of $5r\,px$, where $r = w/640$ and $w$ is the width of the image.

The performance on a dataset is measured by the average recall $\text{AR} = (\text{AR}_{\text{VSD}} + \text{AR}_{\text{MSSD}} + \text{AR}_{\text{MSPD}}) / 3$. The overall performance on the core datasets is measured by $\text{AR}_{\text{Core}}$ defined as the average of the per-dataset average recalls $\text{AR}$. In this way, each dataset is treated as a separate sub-challenge which avoids the overall score being dominated by larger datasets.

### 5. How to Participate

To have your method evaluated, run it on the ViVo task and submit the results in the format described below to the BOP evaluation system (the used evaluation script is publicly available in the BOP Toolkit). Each method has to use a identical set of hyper-parameters across all objects and datasets.

The list of object instances for which the pose is to be estimated can be found in files test_targets_bop19.json provided with the datasets. For each object instance in the list, at least $10\%$ of the object surface is visible in the respective image [1].

#### 5.1 Format of Results

Results for all test images from one dataset are saved in one CSV file, with one pose estimate per line in the following format:

scene_id,im_id,obj_id,score,R,t,time

• scene_id, im_id, and obj_id is the ID of scene, image and object respectively.
• score is a confidence of the estimate (the range of confidence values is not restricted).
• R is a 3x3 rotation matrix whose elements are saved row-wise and separated by a white space (i.e. r11 r12 r13 r21 r22 r23 r31 r32 r33, where rij is an element from the i-th row and the j-th column of the matrix).
• t is a 3x1 translation vector (in mm) whose elements are separated by a white space (i.e. t1 t2 t3).
• time is the time method $M$ took to make estimates for all objects in image im_id from scene scene_id. All estimates with the same scene_id and im_id must have the same value of time. Report the wall time from the point right after the raw data (the image, 3D object models etc.) is loaded to the point when the final pose estimates are available (a single real number in seconds, -1 if not available).

$\mathbf{P} = \mathbf{K} [\mathbf{R} \, \mathbf{t}]$ is the camera matrix which transforms 3D point $\mathbf{x}_m = [x, y, z, 1]'$ in the model coordinates to 2D point $\mathbf{x}_i = [u, v, 1]'$ in the image coordinates: $s\mathbf{x_i} = \mathbf{P} \mathbf{x}_m$. The camera coordinate system is defined as in OpenCV with the camera looking along the $Z$ axis. Camera intrinsic matrix $\mathbf{K}$ is provided with the test images and might be different for each image.

Example results can be found here.

#### 5.2 Terms & Conditions

After the results are evaluated, the authors can decide whether to make the evaluation scores visible to the public. To be considered for the awards, authors need to provide an implementation of the method (source code or a binary file with instructions) which will be validated. For the results to be included in a publication about the challenge, a documentation of the method, including specifications of the used computer, needs to be provided through the online submission form. Without the documentation, the scores will be listed on the website but will not be considered for inclusion in the publication.

### 6. Organizers

Tomáš Hodaň, Czech Technical University in Prague
Eric Brachmann, Heidelberg University
Bertram Drost, MVTec
Frank Michel, Technical University Dresden
Martin Sundermeyer, DLR German Aerospace Center
Jiří Matas, Czech Technical University in Prague
Carsten Rother, Heidelberg University

### References

[1] Hodaň, Michel et al.: BOP: Benchmark for 6D Object Pose Estimation, ECCV'18.

[2] Hodaň et al.: On Evaluation of 6D Object Pose Estimation, ECCVW'16.

[3] Drost et al.: Introducing MVTec ITODD - A Dataset for 3D Object Recognition in Industry, ICCVW'17.

[4] Brachmann et al.: Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image, CVPR'16.

[5] Hinterstoisser et al.: Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes, ACCV'12.

[6] Newcombe et al.: KinectFusion: Real-time dense surface mapping and tracking, ISMAR'11.