Tasks

Model-based tasks on seen objects:

Model-based tasks on unseen objects:

Model-free tasks on unseen objects:


Model-based 6D localization of seen objects (ModelBased-6DLoc-Seen)

Used in the 2019, 2020, 2022, 2022, and 2023 challenges. Can be evaluated on BOP-Classic datasets.

Training input: At training time, a method is provided a set of RGB-D training images showing objects annotated with ground-truth 6D poses, and 3D mesh models of the objects (typically with a color texture). A 6D pose is defined by a matrix $\textbf{P} = [\mathbf{R} \, | \, \mathbf{t}]$, where $\mathbf{R}$ is a 3D rotation matrix, and $\mathbf{t}$ is a 3D translation vector. The matrix $\textbf{P}$ defines a rigid transformation from the 3D space of the object model to the 3D space of the camera.

Test input: At test time, the method is given one or multiple (for multi-view setup) RGB-D image(s) unseen during training and a list $L = [(o_1, n_1),$ $\dots,$ $(o_m, n_m)]$, where $n_i$ is the number of instances of an object $o_i$ that are visible in the image. The method can use default detections (results of GDRNPPDet_PBRReal, the best 2D detection method from 2022 for ModelBased-2DDet-Seen).

Test output: The method produces a list $E=[E_1,$$\dots,$$E_m]$, where $E_i$ is a list of $n_i$ pose estimates with confidences for instances of object $o_i$.

Evaluation methodology: The error of an estimated pose w.r.t. the ground-truth pose is calculated by three pose-error functions (see Section 2.2 in the BOP 2020 paper):

  • VSD (Visible Surface Discrepancy) which treats indistinguishable poses as equivalent by considering only the visible object part.
  • MSSD (Maximum Symmetry-Aware Surface Distance) which considers a set of pre-identified global object symmetries and measures the surface deviation in 3D.
  • MSPD (Maximum Symmetry-Aware Projection Distance) which considers the object symmetries and measures the perceivable deviation.
An estimated pose is considered correct w.r.t. a pose-error function $e$, if $e < \theta_e$, where $e \in \{\text{VSD}, \text{MSSD}, \text{MSPD}\}$ and $\theta_e$ is the threshold of correctness. The fraction of annotated object instances for which a correct pose is estimated is referred to as Recall. The Average Recall w.r.t. a function $e$, denoted as $\text{AR}_e$, is defined as the average of the Recall rates calculated for multiple settings of the threshold $\theta_e$ and also for multiple settings of a misalignment tolerance $\tau$ in the case of $\text{VSD}$. The accuracy of a method on a dataset $D$ is measured by: $\text{AR}_D = (\text{AR}_\text{VSD} + \text{AR}_{\text{MSSD}} + \text{AR}_{\text{MSPD}}) \, / \, 3$, which is calculated over estimated poses of all objects from $D$. The overall accuracy on the core datasets is measured by $\text{AR}_C$ defined as the average of the per-dataset $\text{AR}_D$ scores. See Section 2.4 in the BOP 2020 paper for details.

Model-based 6D detection of seen objects (ModelBased-6DDet-Seen)

Used in the 2025 challenge. Can be evaluated on BOP-Classic, BOP-H3, and BOP-Industrial datasets.

Training input: As in ModelBased-6DLoc-Seen.

Test input: At test time, the method is given one or multiple (for multi-view setup) RGB-D image(s) unseen during training that show an arbitrary number of instances of an arbitrary number of objects, with all objects being from one specified dataset (e.g. YCB-V). No prior information about the visible object instances is provided. The method can use default detections (results of GDRNPPDet_PBRReal, the best 2D detection method from 2022 for ModelBased-2DDet-Seen).

Test output: As in ModelBased-6DLoc-Seen.

Evaluation methodology: Similar to the evaluation methodology from the COCO 2020 Object Detection Challenge used for 2D detection, the 6D detection accuracy is measured by the Average Precision (AP). This metric is calculated by two following pose-error functions (see Section 2.2 in the BOP 2020 paper):

  • MSSD (Maximum Symmetry-Aware Surface Distance) which considers a set of pre-identified global object symmetries and measures the surface deviation in 3D.
  • MSPD (Maximum Symmetry-Aware Projection Distance) which considers the object symmetries and measures the perceivable deviation.
Specifically, for each pose-error function $e \in \{\text{MSSD}, \text{MSPD}\}$, the per-object $\text{AP}_{e,o}$ score is calculated by averaging the precision at multiple thresholds of correctness $\theta_e$. The accuracy (given the pose-error function $e$) of a method on a dataset $D$, called $\text{AP}_{e,D}$, is calculated by averaging per-object $\text{AP}_{e,o}$ scores. The overall accuracy per-dataset $D$, $\text{AP}_{D}$ is defined as $\text{AP}_{D}=(\text{AP}_{MSSD,D}+\text{AP}_{MSPD,D})/2$. The overall accuracy on the core datasets is measured by $\text{AP}_C$, defined as the average of $\text{AP}_{D}$ scores.

Model-based 2D detection of seen objects (ModelBased-2DDet-Seen)

Used in the 2022, 2023, and 2025 challenges. Can be evaluated on BOP-Classic, BOP-H3, and BOP-Industrial datasets.

Training input: At training time, a method is provided a set of training images showing objects that are annotated with ground-truth 2D bounding boxes. The boxes are amodal, i.e., covering the whole object silhouette including the occluded parts. The method can also use 3D mesh models that are available for the objects.

Test input: At test time, the method is given an RGB-D image unseen during training that shows an arbitrary number of instances of an arbitrary number of objects, with all objects being from one specified dataset (e.g. YCB-V). No prior information about the visible object instances is provided.

Test output: The method produces a list of object detections with confidences, with each detection defined by an amodal 2D bounding boxes.

Evaluation methodology: Following the evaluation methodology from the COCO 2020 Object Detection Challenge, the detection accuracy is measured by the Average Precision (AP). Specifically, a per-object $\text{AP}_O$ score is calculated by averaging the precision at multiple Intersection over Union (IoU) thresholds: $[0.5, 0.55, \dots , 0.95]$. The accuracy of a method on a dataset $D$ is measured by $\text{AP}_D$ calculated by averaging per-object $\text{AP}_O$ scores, and the overall accuracy on the core datasets is measured by $\text{AP}_C$ defined as the average of the per-dataset $\text{AP}_D$ scores. Correct predictions for annotated object instances that are visible from less than 10% (and not considered in the evaluation) are filtered out and not counted as false positives. Up to 100 predictions per image (with the highest confidence scores) are considered.

Model-based 2D segmentation of seen objects (ModelBased-2DSeg-Seen)

Used in the 2022 and 2023 challenges. Can be evaluated on BOP-Classic datasets.

Training input: At training time, a method is provided a set of training images showing objects that are annotated with ground-truth 2D binary masks. The masks are modal, i.e., covering only the visible object part. The method can also use 3D mesh models that are available for the objects.

Test input: At test time, the method is given an RGB-D image unseen during training that shows an arbitrary number of instances of an arbitrary number of objects, with all objects being from one specified dataset (e.g. YCB-V). No prior information about the visible object instances is provided.

Test output: The method produces a list of object segmentations with confidences, with each segmentation defined by a modal 2D binary mask.

Evaluation methodology: As in ModelBased-2DDet-Seen, with the only difference being that IoU is calculated on binary masks instead of bounding boxes.

Model-based 6D localization of unseen objects (ModelBased-6DLoc-Unseen)

Used in the 2023, 2024, and 2025 challenges. Can be evaluated on BOP-Classic datasets.

Training input: At training time, a method is provided a set of RGB-D training images showing training objects annotated with ground-truth 6D poses, and 3D mesh models of the objects (typically with a color texture). The 6D object pose is defined as in ModelBased-6DLoc-Seen. The method can use 3D mesh models that are available for the training objects.

Object-onboarding input: The method is provided 3D mesh models of test objects that were not seen during training. To onboard each object (e.g. to render images/templates or fine-tune a neural network), the method can spend up to 5 minutes of the wall-clock time on a single computer with up to one GPU. The time is measured from the point right after the raw data (e.g. 3D mesh models) is loaded to the point when the object is onboarded. The method can use a subset of the BlenderProc images (links "PBR-BlenderProc4BOP training images") originally provided for Tasks 1–3 – the method can use as many images from this set as could be rendered within the limited onboarding time (consider that rendering of one image takes 2 seconds; rendering and any additional processing need to fit within 5 minutes). The method can also render custom images/templates but cannot use any real images of the object in the onboarding stage. The object representation (which may be given by a set of templates, an ML model, etc.) needs to be fixed after onboarding (it cannot be updated on test images).

Test input: At test time, the method is given one or multiple (for multi-view setup) RGB-D image(s) unseen during training and a list $L = [(o_1, n_1),$ $\dots,$ $(o_m, n_m)]$, where $n_i$ is the number of instances of a test object $o_i$ that are visible in the image. The method can use default detections/segmentations (results of CNOS, the best 2D detection method for ModelBased-2DDet-Unseen in 2023).

Test output: As in ModelBased-6DLoc-Seen.

Evaluation methodology: As in ModelBased-6DLoc-Seen.

Model-based 6D detection of unseen objects (ModelBased-6DDet-Unseen)

Used in the 2024 and 2025 challenges. Can be evaluated on BOP-Classic, BOP-H3, and BOP-Industrial datasets.

Training input: As in ModelBased-6DLoc-Unseen.

Object-onboarding input: As in ModelBased-6DLoc-Unseen.

Test input: At test time, the method is given one or multiple RGB-D image(s) unseen during training that show an arbitrary number of instances of an arbitrary number of objects, with all objects being from one specified dataset (e.g. YCB-V). No prior information about the visible object instances is provided. The method can use default detections/segmentations (results of CNOS, the best 2D detection method for ModelBased-2DDet-Unseen in 2023).

Test output: As in ModelBased-6DLoc-Seen.

Evaluation methodology: As in ModelBased-6DDet-Seen.

Model-based 2D detection of unseen objects (ModelBased-2DDet-Unseen)

Used in the 2023, 2024, and 2025 challenges. Can be evaluated on BOP-Classic, BOP-H3, and BOP-Industrial datasets.

Training input: At training time, a method is provided a set of RGB-D training images showing training objects that are annotated with ground-truth 2D bounding boxes. The boxes are amodal, i.e., covering the whole object silhouette including the occluded parts. The method can also use 3D mesh models that are available for the training objects.

Object-onboarding input: As in ModelBased-6DLoc-Unseen.

Test input: At test time, the method is given an RGB-D image unseen during training that shows an arbitrary number of instances of an arbitrary number of test objects, with all objects being from one specified dataset (e.g. YCB-V). No prior information about the visible object instances is provided.

Test output: As in ModelBased-2DDet-Seen.

Evaluation methodology: As in ModelBased-2DDet-Seen.

Model-based 2D segmentation of unseen objects (ModelBased-2DSeg-Unseen)

Used in the 2023 challenge. Can be evaluated on BOP-Classic datasets.

Training input: At training time, a method is provided a set of RGB-D training images showing training objects that are annotated with ground-truth 2D binary masks. The masks are modal, i.e., covering only the visible object part. The method can also use 3D mesh models that are available for the objects.

Object-onboarding input: As in ModelBased-6DLoc-Unseen.

Test input: As in ModelBased-2DDet-Unseen.

Test output: As in ModelBased-2DSeg-Seen.

Evaluation methodology: As in ModelBased-2DSeg-Seen.

Model-free 6D detection of unseen objects (ModelFree-6DDet-Unseen)

Used in the 2024 and 2025 challenges. Can be evaluated on BOP-H3 datasets.

Training input: As in ModelBased-6DLoc-Unseen.

Object-onboarding input: The method is provided reference video(s) of test objects that were not seen during training. 3D models of test objects are not available. A new object needs to be onboarded in max 5 min on 1 GPU from one of the two following types of reference videos:

  • Static onboarding: The object is static (standing on a desk) and the camera is moving around the object and capturing all possible object views. Two videos are available, one with the object standing upright and one with the object standing upside-down. This type of onboarding videos is useful for 3D object reconstruction by methods such as Gaussian Splatting. Ground-truth object poses are available for all video frames (as they could be relatively easily obtained with tools like COLMAP anyway).
  • Dynamic onboarding: The object is manipulated by hands and the camera is either static (on a tripod) or dynamic (on a head-mounted device). This type of onboarding videos is useful for 3D object reconstruction by methods such as BundleSDF or Hampali et al. Ground-truth object poses are available only for the first frame to simulate a real-world setup (at least one ground-truth pose needs to be provided to define the object coordinate system, which is necessary for evaluation of object pose estimates). Compared to the static onboarding setup, the dynamic onboarding setup is more challenging but more natural for AR/VR applications.

Test input: At test time, the method is given an RGB-D image unseen during training that shows an arbitrary number of instances of an arbitrary number of test objects, with all objects being from one specified dataset (e.g. YCB-V). No prior information about the visible object instances is provided. The method can use default detections/segmentations (results of CNOS, the best 2D detection method for ModelBased-2DDet-Unseen in 2023 adapted to the model-free setup by replacing templates rendered from CAD models with reference images from onboarding videos).

Test output: As in ModelBased-6DLoc-Seen.

Evaluation methodology: As in ModelBased-6DDet-Seen.

Model-free 2D detection of unseen objects (ModelFree-2DDet-Unseen)

Used in the 2024 and 2025 challenges. Can be evaluated on BOP-H3 datasets.

Training input: As in ModelBased-2DDet-Unseen.

Object-onboarding input: As in ModelFree-6DDet-Unseen.

Test input: As in ModelBased-2DDet-Unseen.

Test output: As in ModelBased-2DDet-Seen.

Evaluation methodology: As in ModelBased-2DDet-Seen.