- 2019-10-22 10:16 - Sundermeyer-IJCV19+ICP - ITODD by MartinSmeyer
- 2019-10-22 10:00 - Sundermeyer-IJCV19+ICP - T-LESS by MartinSmeyer
- 2019-10-22 09:09 - Sundermeyer-IJCV19+ICP - YCB-V by MartinSmeyer
- 2019-10-22 07:57 - Vidal-Sensors18 - YCB-V by jolvid
- 2019-10-22 07:56 - Vidal-Sensors18 - T-LESS by jolvid
- 2019-10-22 07:53 - Vidal-Sensors18 - IC-BIN by jolvid
- 2019-10-22 07:52 - Vidal-Sensors18 - TUD-L by jolvid
- 2019-10-22 07:43 - Vidal-Sensors18 - HB by jolvid
- 2019-10-22 07:34 - Vidal-Sensors18 - ITODD by jolvid
- 2019-10-22 07:27 - Vidal-Sensors18 - LM-O by jolvid

Deadline for submission of results: ~~October 14~~, **October 21, 2019**
(11:59PM PST)

Presentation of awards: **October 28, 2019**
(at the ICCV 2019
workshop)

The challenge is on the task of 6D localization of a * varying*
number of

**Training Input:**
At training time, method $M$
learns using a training set,
$T = \{T_o\}$, where $o$ is an object identifier.
Training data
$T_o$ may have different forms – a 3D mesh model of the
object or a set of RGB-D images (synthetic or real) showing object
instances in known 6D poses.

**Test Input:**
At test time, method $M$ is provided
with image $I$ and list
$L = [(o_1, n_1), ..., (o_m, n_m)]$,
where $n_i$ is the number of instances of object $o_i$
present in image $I$.

**Test Output:**
Method $M$ produces list
$E = [E_1, \dots, E_m]$, where $E_i$
is a list of $n_i$ pose estimates for instances of object $o_i$. Each
estimate is given by a 3x3 rotation matrix, $\mathbf{R}$, a 3x1
translation vector, $\mathbf{t}$, and a confidence score, $s$.

The ViVo task is referred to as the *6D localization problem* in
[2]. In the BOP paper [1], methods were evaluated on a different task – 6D
localization of a single instance of a single object (the SiSo task),
which was chosen because it allowed to evaluate all relevant methods out
of the box. Since then, the state of the art has advanced and we have
moved to the more challenging ViVo task.

Multiple datasets are used for the evaluation. Every dataset includes 3D object models and training and test RGB-D images annotated with ground-truth 6D object poses and intrinsic camera parameters. Some datasets include also validation images – in this case, the ground-truth 6D object poses are publicly available only for the validation images, not for the test images. The 3D object models were created manually or using KinectFusion-like systems for 3D surface reconstruction [6]. The training images show individual objects from different viewpoints and are either captured by an RGB-D/Gray-D sensor or obtained by rendering of the 3D object models. The test images were captured in scenes with graded complexity, often with clutter and occlusion. The datasets are provided in the BOP format.

For training, method $M$ can use the provided object models and training images and can render extra training images using the object models. Not a single pixel of test images may be used in training, nor the individual ground-truth poses or object masks provided for the test images. The range of all ground-truth poses in the test images, which is provided in file dataset_params.py in the BOP Toolkit, is the only information about the test set that can be used during training.

Core Datasets:
LM-O,
T-LESS,
TUD-L,
IC-BIN,
ITODD,
HB,
YCB-V.
Method $M$ needs to be evaluated of these 7 datasets to be
considered for the main awards.

Other Datasets:
LM,
RU-APC,
IC-MI,
TYO-L.

Only subsets of the original datasets are used to speed up the evaluation.
The subsets are defined in files *test_targets_bop19.json* provided
with the datasets.

The following awards will be presented at the 5th International Workshop on Recovering 6D Object Pose at ICCV 2019:

*The Overall Best Method*– The top-performing method on the 7 core datasets.*The Best RGB-Only Method*– The top-performing RGB-only method on the 7 core datasets.*The Best Fast Method*– The top-performing method on the 7 core datasets with the average running time per image below 1s.*The Best Open Source Method*– The top-performing method on the 7 core datasets whose source code is publicly available.*The Best Method on Dataset D*– The top-performing method on each of the 11 available datasets.

To be considered for the awards, authors need to provide an implementation of the method (source code or a binary file with instructions) which will be validated.

The error of an estimated pose $\hat{\textbf{P}}$ w.r.t. the ground-truth pose $\bar{\textbf{P}}$ of an object model $O$ is measured by three pose error functions defined below. Their implementation is available in the BOP Toolkit.

**Visible Surface Discrepancy (VSD) [1,2]:**

$
e_\mathrm{VSD}(\hat{S}, \bar{S}, S_I, \hat{V}, \bar{V}, \tau) =
\mathrm{avg}_{p \in \hat{V} \cup \bar{V}}
\begin{cases}
0 & \text{if} \, p \in \hat{V} \cap \bar{V} \, \wedge \, |\hat{S}(p) -
\bar{S}(p)| < \tau \\
1 & \text{otherwise}
\end{cases}
$

$\hat{S}$ and $\bar{S}$ are distance maps obtained by rendering the object model $O$ in the estimated pose $\hat{\textbf{P}}$ and the ground-truth pose $\bar{\textbf{P}}$ respectively. As in [1,2], the distance maps are compared with the distance map $S_I$ of the test image $I$ to obtain the visibility masks $\hat{V}$ and $\bar{V}$, i.e. the sets of pixels where the model $O$ is visible in image $I$. Estimation of the visibility masks has been modified – at pixels with no depth measurements, an object is now considered visible (it was considered not visible in [1,2]). This modification allows evaluating poses of glossy objects from the ITODD dataset whose surface is not always captured by the depth sensors. $\tau$ is a misalignment tolerance. See Section 2.2 of [1] for details.

**Maximum Symmetry-Aware Surface Distance (MSSD) [3]:**

$
e_{\text{MSSD}} = \text{min}_{\textbf{S} \in S_O} \text{max}_{\textbf{x}
\in O}
\Vert \hat{\textbf{P}}\textbf{x} - \bar{\textbf{P}}\textbf{S}\textbf{x}
\Vert_2
$

$S_O$ is a set of symmetry transformations of object model $O$ (see Section 4.2). The maximum distance is relevant for robotic manipulation, where the maximum surface deviation strongly indicates the chance of a successful grasp. The maximum distance is also less dependent on the sampling strategy of the model vertices than the average distance used in ADD/ADI [2, 5], which tends to be dominated by higher-frequency surface parts.

**Maximum Symmetry-Aware Projection Distance (MSPD):**

$
e_{\text{MSPD}} = \text{min}_{\textbf{S} \in S_O} \text{max}_{\textbf{x}
\in O}
\Vert \text{proj}( \hat{\textbf{P}}\textbf{x} ) - \text{proj}(
\bar{\textbf{P}}\textbf{S}\textbf{x} ) \Vert_2
$

$\text{proj}$ is the 2D projection operation (the result is in pixels) and
the meaning of the other symbols is as in MSSD.
Compared to the *2D projection* [4],
MSPD considers object symmetries and replaces the average by the
maximum distance to increase robustness against the model sampling.
As MSPD does not evaluate the alignment along the optical axis (Z axis) and
measures only the perceivable discrepancy, it is relevant for augmented
reality applications and suitable for the evaluation of RGB-only methods.

The set of potential symmetry transformations (used in MSSD and MSPD) is defined as $S'_O = \{\textbf{S}: h(O, \textbf{S}O) < \varepsilon \}$, where $h$ is the Hausdorff distance calculated between the vertices of object model $O$. The allowed deviation is bounded by $\varepsilon = \text{max}(15\,mm, 0.1d)$, where $d$ is the diameter of model $O$ (the largest distance between any pair of model vertices) and the truncation at $15\,mm$ avoids breaking the symmetries by too small details. The set of symmetry transformations $S_O$ is a subset of $S'_O$ and consists of those symmetry transformations which cannot be resolved by the model texture (decided subjectively).

Set $S_O$ covers both discrete and continuous rotational symmetries.
The continuous rotational symmetries are discretized such as the vertex
which is the furthest from the axis of symmetry travels not more than $1\%$
of the object diameter between two consecutive rotations.
The symmetry transformations are stored in files
*models_info.json* provided with the datasets.

The performance of method $M$ w.r.t. pose error function $e_{\text{VSD}}$ is measured by the average recall $\text{AR}_{\text{VSD}}$ defined as the average of the recall rates for $\tau$ ranging from $5\%$ to $50\%$ of the object diameter with a step of $5\%$, and for the threshold of correctness $\theta_{\text{VSD}}$ ranging from $0.05$ to $0.5$ with a step of $0.05$. The recall rate is the fraction of annotated object instances for which a correct object pose was estimated. A pose estimate is considered correct if $e_{\text{VSD}} < \theta_{\text{VSD}}$.

Similarly, the performance w.r.t. $e_{\text{MSSD}}$ is measured by the average recall $\text{AR}_{\text{MSSD}}$ defined as the average of the recall rates for the threshold of correctness $\theta_{\text{MSSD}}$ ranging from $5\%$ to $50\%$ of the object diameter with a step of $5\%$. The performance w.r.t. $e_{\text{MSPD}}$ is measured by $\text{AR}_{\text{MSPD}}$ defined as the average of the recall rates for $\theta_{\text{MSPD}}$ ranging from $5r\,px$ to $50r\,px$ with a step of $5r\,px$, where $r = w/640$ and $w$ is the width of the image.

The performance on a dataset is measured by the average recall $\text{AR} = (\text{AR}_{\text{VSD}} + \text{AR}_{\text{MSSD}} + \text{AR}_{\text{MSPD}}) / 3$. The overall performance on the core datasets is measured by $\text{AR}_{\text{Core}}$ defined as the average of the per-dataset average recalls $\text{AR}$. In this way, each dataset is treated as a separate sub-challenge which avoids the overall score being dominated by larger datasets.

To have your method evaluated, run it on the ViVo task and submit the
results in the format described below to the
BOP evaluation system
(the used evaluation script is publicly available in the
BOP
Toolkit).
**Each method has to use a identical set of hyper-parameters across all objects and datasets.**

The list of object instances for which the pose is to be
estimated can be found in files *test_targets_bop19.json* provided
with the datasets. For each object instance in the list, at least $10\%$ of
the object surface is visible in the respective image [1].

Results for all test images from one dataset are saved in one CSV file, with one pose estimate per line in the following format:

`scene_id,im_id,obj_id,score,R,t,time`

`scene_id`

,`im_id`

, and`obj_id`

is the ID of scene, image and object respectively.`score`

is a confidence of the estimate (the range of confidence values is not restricted).`R`

is a 3x3 rotation matrix whose elements are saved row-wise and separated by a white space (i.e.`r11 r12 r13 r21 r22 r23 r31 r32 r33`

, where*r*is an element from the_{ij}*i*-th row and the*j*-th column of the matrix).`t`

is a 3x1 translation vector (in mm) whose elements are separated by a white space (i.e.`t1 t2 t3`

).`time`

is the time method $M$ took to make estimates for all objects in image`im_id`

from scene`scene_id`

. All estimates with the same`scene_id`

and`im_id`

must have the same value of`time`

. Report the wall time from the point right after the raw data (the image, 3D object models etc.) is loaded to the point when the final pose estimates are available (a single real number in seconds, -1 if not available).

$\mathbf{P} = \mathbf{K} [\mathbf{R} \, \mathbf{t}]$ is the camera matrix which transforms 3D point $\mathbf{x}_m = [x, y, z, 1]'$ in the model coordinates to 2D point $\mathbf{x}_i = [u, v, 1]'$ in the image coordinates: $s\mathbf{x_i} = \mathbf{P} \mathbf{x}_m$. The camera coordinate system is defined as in OpenCV with the camera looking along the $Z$ axis. Camera intrinsic matrix $\mathbf{K}$ is provided with the test images and might be different for each image.

Example results can be found here.

After the results are evaluated, the authors can decide whether to make the evaluation scores visible to the public. To be considered for the awards, authors need to provide an implementation of the method (source code or a binary file with instructions) which will be validated. For the results to be included in a publication about the challenge, a documentation of the method, including specifications of the used computer, needs to be provided through the online submission form. Without the documentation, the scores will be listed on the website but will not be considered for inclusion in the publication.

**Tomáš Hodaň**, Czech Technical University in Prague

**Eric Brachmann**, Heidelberg University

**Bertram Drost**, MVTec

**Frank Michel**, Technical University Dresden

**Martin Sundermeyer**, DLR German Aerospace Center

**Jiří Matas**, Czech Technical University in Prague

**Carsten Rother**, Heidelberg University

[1] Hodaň, Michel et al.: BOP: Benchmark for 6D Object Pose Estimation, ECCV'18.

[2] Hodaň et al.: On Evaluation of 6D Object Pose Estimation, ECCVW'16.

[3] Drost et al.: Introducing MVTec ITODD - A Dataset for 3D Object Recognition in Industry, ICCVW'17.

[4] Brachmann et al.: Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image, CVPR'16.

[5] Hinterstoisser et al.: Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes, ACCV'12.

[6] Newcombe et al.: KinectFusion: Real-time dense surface mapping and tracking, ISMAR'11.