BOP: Benchmark for 6D Object Pose Estimation

Tasks

Model-based tasks on seen objects:

Model-based tasks on unseen objects:

Model-free tasks on unseen objects:

Model-based 6D localization of seen objects (ModelBased-6DLoc-Seen)

Used in the 2019, 2020, 2022, 2022, and 2023 challenges. Can be evaluated on BOP-Classic datasets.

Training input: At training time, a method is provided a set of RGB-D training images showing objects annotated with ground-truth 6D poses, and 3D mesh models of the objects (typically with a color texture). A 6D pose is defined by a matrix $\textbf{P} = [\mathbf{R} \, | \, \mathbf{t}]$, where $\mathbf{R}$ is a 3D rotation matrix, and $\mathbf{t}$ is a 3D translation vector. The matrix $\textbf{P}$ defines a rigid transformation from the 3D space of the object model to the 3D space of the camera.

Test input: At test time, the method is given an RGB-D image unseen during training and a list $L = [(o_1, n_1),$ $\dots,$ $(o_m, n_m)]$, where $n_i$ is the number of instances of an object $o_i$ that are visible in the image. The method can use default detections (results of GDRNPPDet_PBRReal, the best 2D detection method from 2022 for ModelBased-2DDet-Seen).

Test output: The method produces a list $E=[E_1,$$\dots,$$E_m]$, where $E_i$ is a list of $n_i$ pose estimates with confidences for instances of object $o_i$.

Evaluation methodology: The error of an estimated pose w.r.t. the ground-truth pose is calculated by three pose-error functions (see Section 2.2 in the BOP 2020 paper):

VSD (Visible Surface Discrepancy) which treats indistinguishable poses as equivalent by considering only the visible object part.
MSSD (Maximum Symmetry-Aware Surface Distance) which considers a set of pre-identified global object symmetries and measures the surface deviation in 3D.
MSPD (Maximum Symmetry-Aware Projection Distance) which considers the object symmetries and measures the perceivable deviation.

An estimated pose is considered correct w.r.t. a pose-error function $e$, if $e < \theta_e$, where $e \in \{\text{VSD}, \text{MSSD}, \text{MSPD}\}$ and $\theta_e$ is the threshold of correctness. The fraction of annotated object instances for which a correct pose is estimated is referred to as Recall. The Average Recall w.r.t. a function $e$, denoted as $\text{AR}_e$, is defined as the average of the Recall rates calculated for multiple settings of the threshold $\theta_e$ and also for multiple settings of a misalignment tolerance $\tau$ in the case of $\text{VSD}$. The accuracy of a method on a dataset $D$ is measured by: $\text{AR}_D = (\text{AR}_\text{VSD} + \text{AR}_{\text{MSSD}} + \text{AR}_{\text{MSPD}}) \, / \, 3$, which is calculated over estimated poses of all objects from $D$. The overall accuracy on the core datasets is measured by $\text{AR}_C$ defined as the average of the per-dataset $\text{AR}_D$ scores. See Section 2.4 in the BOP 2020 paper for details.

Model-based 6D detection of seen objects (ModelBased-6DDet-Seen)

Used in the 2024 challenge. Can be evaluated on BOP-Classic and BOP-H3 datasets.

Training input: As in ModelBased-6DLoc-Seen.

Test input: At test time, the method is given an RGB-D image unseen during training that shows an arbitrary number of instances of an arbitrary number of objects, with all objects being from one specified dataset (e.g. YCB-V). No prior information about the visible object instances is provided. The method can use default detections (results of GDRNPPDet_PBRReal, the best 2D detection method from 2022 for ModelBased-2DDet-Seen).

Test output: As in ModelBased-6DLoc-Seen.

Evaluation methodology: Similar to the evaluation methodology from the COCO 2020 Object Detection Challenge used for 2D detection, the 6D detection accuracy is measured by the Average Precision (AP). This metric is calculated by two following pose-error functions (see Section 2.2 in the BOP 2020 paper):

MSSD (Maximum Symmetry-Aware Surface Distance) which considers a set of pre-identified global object symmetries and measures the surface deviation in 3D.
MSPD (Maximum Symmetry-Aware Projection Distance) which considers the object symmetries and measures the perceivable deviation.

Specifically, for each pose-error function $e \in \{\text{MSSD}, \text{MSPD}\}$, the per-object $\text{AP}_{e,o}$ score is calculated by averaging the precision at multiple thresholds of correctness $\theta_e$. The accuracy (given the pose-error function $e$) of a method on a dataset $D$, called $\text{AP}_{e,D}$, is calculated by averaging per-object $\text{AP}_{e,o}$ scores. The overall accuracy per-dataset $D$, $\text{AP}_{D}$ is defined as $\text{AP}_{D}=(\text{AP}_{MSSD,D}+\text{AP}_{MSPD,D})/2$. The overall accuracy on the core datasets is measured by $\text{AP}_C$, defined as the average of $\text{AP}_{D}$ scores.

Model-based 2D detection of seen objects (ModelBased-2DDet-Seen)

Used in the 2022 and 2023 challenges. Can be evaluated on BOP-Classic and BOP-H3 datasets.

Training input: At training time, a method is provided a set of training images showing objects that are annotated with ground-truth 2D bounding boxes. The boxes are amodal, i.e., covering the whole object silhouette including the occluded parts. The method can also use 3D mesh models that are available for the objects.

Test output: The method produces a list of object detections with confidences, with each detection defined by an amodal 2D bounding boxes.

Evaluation methodology: Following the evaluation methodology from the COCO 2020 Object Detection Challenge, the detection accuracy is measured by the Average Precision (AP). Specifically, a per-object $\text{AP}_O$ score is calculated by averaging the precision at multiple Intersection over Union (IoU) thresholds: $[0.5, 0.55, \dots , 0.95]$. The accuracy of a method on a dataset $D$ is measured by $\text{AP}_D$ calculated by averaging per-object $\text{AP}_O$ scores, and the overall accuracy on the core datasets is measured by $\text{AP}_C$ defined as the average of the per-dataset $\text{AP}_D$ scores. Correct predictions for annotated object instances that are visible from less than 10% (and not considered in the evaluation) are filtered out and not counted as false positives. Up to 100 predictions per image (with the highest confidence scores) are considered.

Model-based 2D segmentation of seen objects (ModelBased-2DSeg-Seen)

Used in the 2022 and 2023 challenges. Can be evaluated on BOP-Classic datasets.

Training input: At training time, a method is provided a set of training images showing objects that are annotated with ground-truth 2D binary masks. The masks are modal, i.e., covering only the visible object part. The method can also use 3D mesh models that are available for the objects.

Test output: The method produces a list of object segmentations with confidences, with each segmentation defined by a modal 2D binary mask.

Evaluation methodology: As in ModelBased-2DDet-Seen, with the only difference being that IoU is calculated on binary masks instead of bounding boxes.

Model-based 6D localization of unseen objects (ModelBased-6DLoc-Unseen)

Used in the 2023 challenge. Can be evaluated on BOP-Classic datasets.

Training input: At training time, a method is provided a set of RGB-D training images showing training objects annotated with ground-truth 6D poses, and 3D mesh models of the objects (typically with a color texture). The 6D object pose is defined as in ModelBased-6DLoc-Seen. The method can use 3D mesh models that are available for the training objects.

Object-onboarding input: The method is provided 3D mesh models of test objects that were not seen during training. To onboard each object (e.g. to render images/templates or fine-tune a neural network), the method can spend up to 5 minutes of the wall-clock time on a single computer with up to one GPU. The time is measured from the point right after the raw data (e.g. 3D mesh models) is loaded to the point when the object is onboarded. The method can use a subset of the BlenderProc images (links "PBR-BlenderProc4BOP training images") originally provided for Tasks 1–3 – the method can use as many images from this set as could be rendered within the limited onboarding time (consider that rendering of one image takes 2 seconds; rendering and any additional processing need to fit within 5 minutes). The method can also render custom images/templates but cannot use any real images of the object in the onboarding stage. The object representation (which may be given by a set of templates, an ML model, etc.) needs to be fixed after onboarding (it cannot be updated on test images).

Test input: At test time, the method is given an RGB-D image unseen during training and a list $L = [(o_1, n_1),$ $\dots,$ $(o_m, n_m)]$, where $n_i$ is the number of instances of a test object $o_i$ that are visible in the image. The method can use default detections/segmentations (results of CNOS, the best 2D detection method for ModelBased-2DDet-Unseen in 2023).

Test output: As in ModelBased-6DLoc-Seen.

Evaluation methodology: As in ModelBased-6DLoc-Seen.

Model-based 6D detection of unseen objects (ModelBased-6DDet-Unseen)

Used in the 2024 challenge. Can be evaluated on BOP-Classic and BOP-H3 datasets.

Training input: As in ModelBased-6DLoc-Unseen.

Object-onboarding input: As in ModelBased-6DLoc-Unseen.

Test input: At test time, the method is given an RGB-D image unseen during training that shows an arbitrary number of instances of an arbitrary number of objects, with all objects being from one specified dataset (e.g. YCB-V). No prior information about the visible object instances is provided. The method can use default detections/segmentations (results of CNOS, the best 2D detection method for ModelBased-2DDet-Unseen in 2023).

Test output: As in ModelBased-6DLoc-Seen.

Evaluation methodology: As in ModelBased-6DDet-Seen.

Model-based 2D detection of unseen objects (ModelBased-2DDet-Unseen)

Used in the 2023 challenge. Can be evaluated on BOP-Classic and BOP-H3 datasets.

Training input: At training time, a method is provided a set of RGB-D training images showing training objects that are annotated with ground-truth 2D bounding boxes. The boxes are amodal, i.e., covering the whole object silhouette including the occluded parts. The method can also use 3D mesh models that are available for the training objects.

Object-onboarding input: As in ModelBased-6DLoc-Unseen.

Test output: As in ModelBased-2DDet-Seen.

Evaluation methodology: As in ModelBased-2DDet-Seen.

Model-based 2D segmentation of unseen objects (ModelBased-2DSeg-Unseen)

Used in the 2023 challenge. Can be evaluated on BOP-Classic datasets.

Training input: At training time, a method is provided a set of RGB-D training images showing training objects that are annotated with ground-truth 2D binary masks. The masks are modal, i.e., covering only the visible object part. The method can also use 3D mesh models that are available for the objects.

Object-onboarding input: As in ModelBased-6DLoc-Unseen.

Test input: As in ModelBased-2DDet-Unseen.

Test output: As in ModelBased-2DSeg-Seen.

Evaluation methodology: As in ModelBased-2DSeg-Seen.

Model-free 6D detection of unseen objects (ModelFree-6DDet-Unseen)

Used in the 2024 challenge. Can be evaluated on BOP-H3 datasets.

Training input: As in ModelBased-6DLoc-Unseen.

Object-onboarding input: The method is provided reference video(s) of test objects that were not seen during training. 3D models of test objects are not available. The method can use only one of the two following types of reference videos:

Static onboarding: The object is static (standing on a desk) and the camera is moving around the object and capturing all possible object views. Two videos are available, one with the object standing upright and one with the object standing upside-down. Object masks and 6D object poses are available for all video frames, which may be useful for 3D object reconstruction such as NeRF or Gaussian Splatting.
Dynamic onboarding: The object is manipulated by hands and the camera is either static (on a tripod) or dynamic (on a head-mounted device). Object masks for all video frames and the 6D object pose for the first frame are available, which may be useful for 3D object reconstruction using methods such as BundleSDF or Hampali et al.

To onboard each object (e.g. to reconstruct a 3D model, render novel views, or fine-tune a neural network), the method can spend up to 5 minutes of the wall-clock time on a single computer with up to one GPU. The time is measured from the point right after the raw data, e.g., reference video(s), is loaded to the point when the object is onboarded. The object representation (which may be given by a set of templates, neural radiance fields, 3D Gaussians, etc.) needs to be fixed after onboarding (it cannot be updated on test images).

Test input: At test time, the method is given an RGB-D image unseen during training that shows an arbitrary number of instances of an arbitrary number of test objects, with all objects being from one specified dataset (e.g. YCB-V). No prior information about the visible object instances is provided. The method can use default detections/segmentations (results of CNOS, the best 2D detection method for model-based unseen objects but adapted to model-free unseen objects by replacing templates rendered from CAD models with reference images from onboarding video(s)).

Test output: As in ModelBased-6DLoc-Seen.

Evaluation methodology: As in ModelBased-6DDet-Seen.

Model-free 2D detection of unseen objects (ModelFree-2DDet-Unseen)

Used in the 2024 challenge. Can be evaluated on BOP-H3 datasets.

Training input: As in ModelBased-2DDet-Unseen.

Object-onboarding input: As in ModelFree-6DDet-Unseen.

Test input: As in ModelBased-2DDet-Unseen.

Test output: As in ModelBased-2DDet-Seen.

Evaluation methodology: As in ModelBased-2DDet-Seen.