Use Custom Datasets

This document explains how the dataset APIs (DatasetCatalog, MetadataCatalog) work, and how to use them to add custom datasets.

Datasets that have builtin support in detectron2 are listed in builtin datasets. If you want to use a custom dataset while also reusing detectron2’s data loaders, you will need to:

  1. Register your dataset (i.e., tell detectron2 how to obtain your dataset).

  2. Optionally, register metadata for your dataset.

Next, we explain the above two concepts in detail.

The Colab tutorial has a live example of how to register and train on a dataset of custom formats.

Register a Dataset

To let detectron2 know how to obtain a dataset named “my_dataset”, users need to implement a function that returns the items in your dataset and then tell detectron2 about this function:

def my_dataset_function():
  return list[dict] in the following format

from import DatasetCatalog
DatasetCatalog.register("my_dataset", my_dataset_function)
# later, to access the data:
data: List[Dict] = DatasetCatalog.get("my_dataset")

Here, the snippet associates a dataset named “my_dataset” with a function that returns the data. The function must return the same data (with same order) if called multiple times. The registration stays effective until the process exits.

The function can do arbitrary things and should return the data in list[dict], each dict in either of the following formats:

  1. Detectron2’s standard dataset dict, described below. This will make it work with many other builtin features in detectron2, so it’s recommended to use it when it’s sufficient.

  2. Any custom format. You can also return arbitrary dicts in your own format, such as adding extra keys for new tasks. Then you will need to handle them properly downstream as well. See below for more details.

Standard Dataset Dicts

For standard tasks (instance detection, instance/semantic/panoptic segmentation, keypoint detection), we load the original dataset into list[dict] with a specification similar to COCO’s annotations. This is our standard representation for a dataset.

Each dict contains information about one image. The dict may have the following fields, and the required fields vary based on what the dataloader or the task needs (see more below).




file_name, height, width, image_id

Instance detection/segmentation


Semantic segmentation


Panoptic segmentation

pan_seg_file_name, segments_info

  • file_name: the full path to the image file.

  • height, width: integer. The shape of the image.

  • image_id (str or int): a unique id that identifies this image. Required by many evaluators to identify the images, but a dataset may use it for different purposes.

  • annotations (list[dict]): Required by instance detection/segmentation or keypoint detection tasks. Each dict corresponds to annotations of one instance in this image, and may contain the following keys:

    • bbox (list[float], required): list of 4 numbers representing the bounding box of the instance.

    • bbox_mode (int, required): the format of bbox. It must be a member of structures.BoxMode. Currently supports: BoxMode.XYXY_ABS, BoxMode.XYWH_ABS.

    • category_id (int, required): an integer in the range [0, num_categories-1] representing the category label. The value num_categories is reserved to represent the “background” category, if applicable.

    • segmentation (list[list[float]] or dict): the segmentation mask of the instance.

      • If list[list[float]], it represents a list of polygons, one for each connected component of the object. Each list[float] is one simple polygon in the format of [x1, y1, ..., xn, yn] (n≥3). The Xs and Ys are absolute coordinates in unit of pixels.

      • If dict, it represents the per-pixel segmentation mask in COCO’s compressed RLE format. The dict should have keys “size” and “counts”. You can convert a uint8 segmentation mask of 0s and 1s into such dict by pycocotools.mask.encode(np.asarray(mask, order="F")). cfg.INPUT.MASK_FORMAT must be set to bitmask if using the default data loader with such format.

    • keypoints (list[float]): in the format of [x1, y1, v1,…, xn, yn, vn]. v[i] means the visibility of this keypoint. n must be equal to the number of keypoint categories. The Xs and Ys are absolute real-value coordinates in range [0, W or H].

      (Note that the keypoint coordinates in COCO format are integers in range [0, W-1 or H-1], which is different from our standard format. Detectron2 adds 0.5 to COCO keypoint coordinates to convert them from discrete pixel indices to floating point coordinates.)

    • iscrowd: 0 (default) or 1. Whether this instance is labeled as COCO’s “crowd region”. Don’t include this field if you don’t know what it means.

    If annotations is an empty list, it means the image is labeled to have no objects. Such images will by default be removed from training, but can be included using DATALOADER.FILTER_EMPTY_ANNOTATIONS.

  • sem_seg_file_name (str): The full path to the semantic segmentation ground truth file. It should be a grayscale image whose pixel values are integer labels.

  • pan_seg_file_name (str): The full path to panoptic segmentation ground truth file. It should be an RGB image whose pixel values are integer ids encoded using the panopticapi.utils.id2rgb function. The ids are defined by segments_info. If an id does not appear in segments_info, the pixel is considered unlabeled and is usually ignored in training & evaluation.

  • segments_info (list[dict]): defines the meaning of each id in panoptic segmentation ground truth. Each dict has the following keys:

    • id (int): integer that appears in the ground truth image.

    • category_id (int): an integer in the range [0, num_categories-1] representing the category label.

    • iscrowd: 0 (default) or 1. Whether this instance is labeled as COCO’s “crowd region”.


The PanopticFPN model does not use the panoptic segmentation format defined here, but a combination of both instance segmentation and semantic segmentation data format. See Use Builtin Datasets for instructions on COCO.

Fast R-CNN (with pre-computed proposals) models are rarely used today. To train a Fast R-CNN, the following extra keys are needed:

  • proposal_boxes (array): 2D numpy array with shape (K, 4) representing K precomputed proposal boxes for this image.

  • proposal_objectness_logits (array): numpy array with shape (K, ), which corresponds to the objectness logits of proposals in ‘proposal_boxes’.

  • proposal_bbox_mode (int): the format of the precomputed proposal bbox. It must be a member of structures.BoxMode. Default is BoxMode.XYXY_ABS.

Custom Dataset Dicts for New Tasks

In the list[dict] that your dataset function returns, the dictionary can also have arbitrary custom data. This will be useful for a new task that needs extra information not covered by the standard dataset dicts. In this case, you need to make sure the downstream code can handle your data correctly. Usually this requires writing a new mapper for the dataloader (see Use Custom Dataloaders).

When designing a custom format, note that all dicts are stored in memory (sometimes serialized and with multiple copies). To save memory, each dict is meant to contain small but sufficient information about each sample, such as file names and annotations. Loading full samples typically happens in the data loader.

For attributes shared among the entire dataset, use Metadata (see below). To avoid extra memory, do not save such information inside each sample.

“Metadata” for Datasets

Each dataset is associated with some metadata, accessible through MetadataCatalog.get(dataset_name).some_metadata. Metadata is a key-value mapping that contains information that’s shared among the entire dataset, and usually is used to interpret what’s in the dataset, e.g., names of classes, colors of classes, root of files, etc. This information will be useful for augmentation, evaluation, visualization, logging, etc. The structure of metadata depends on what is needed from the corresponding downstream code.

If you register a new dataset through DatasetCatalog.register, you may also want to add its corresponding metadata through MetadataCatalog.get(dataset_name).some_key = some_value, to enable any features that need the metadata. You can do it like this (using the metadata key “thing_classes” as an example):

from import MetadataCatalog
MetadataCatalog.get("my_dataset").thing_classes = ["person", "dog"]

Here is a list of metadata keys that are used by builtin features in detectron2. If you add your own dataset without these metadata, some features may be unavailable to you:

  • thing_classes (list[str]): Used by all instance detection/segmentation tasks. A list of names for each instance/thing category. If you load a COCO format dataset, it will be automatically set by the function load_coco_json.

  • thing_colors (list[tuple(r, g, b)]): Pre-defined color (in [0, 255]) for each thing category. Used for visualization. If not given, random colors will be used.

  • stuff_classes (list[str]): Used by semantic and panoptic segmentation tasks. A list of names for each stuff category.

  • stuff_colors (list[tuple(r, g, b)]): Pre-defined color (in [0, 255]) for each stuff category. Used for visualization. If not given, random colors are used.

  • ignore_label (int): Used by semantic and panoptic segmentation tasks. Pixels in ground-truth annotations with this category label should be ignored in evaluation. Typically these are “unlabeled” pixels.

  • keypoint_names (list[str]): Used by keypoint detection. A list of names for each keypoint.

  • keypoint_flip_map (list[tuple[str]]): Used by keypoint detection. A list of pairs of names, where each pair are the two keypoints that should be flipped if the image is flipped horizontally during augmentation.

  • keypoint_connection_rules: list[tuple(str, str, (r, g, b))]. Each tuple specifies a pair of keypoints that are connected and the color (in [0, 255]) to use for the line between them when visualized.

Some additional metadata that are specific to the evaluation of certain datasets (e.g. COCO):

  • thing_dataset_id_to_contiguous_id (dict[int->int]): Used by all instance detection/segmentation tasks in the COCO format. A mapping from instance class ids in the dataset to contiguous ids in range [0, #class). Will be automatically set by the function load_coco_json.

  • stuff_dataset_id_to_contiguous_id (dict[int->int]): Used when generating prediction json files for semantic/panoptic segmentation. A mapping from semantic segmentation class ids in the dataset to contiguous ids in [0, num_categories). It is useful for evaluation only.

  • json_file: The COCO annotation json file. Used by COCO evaluation for COCO-format datasets.

  • panoptic_root, panoptic_json: Used by COCO-format panoptic evaluation.

  • evaluator_type: Used by the builtin main training script to select evaluator. Don’t use it in a new training script. You can just provide the DatasetEvaluator for your dataset directly in your main script.


In recognition, sometimes we use the term “thing” for instance-level tasks, and “stuff” for semantic segmentation tasks. Both are used in panoptic segmentation tasks. For background on the concept of “thing” and “stuff”, see On Seeing Stuff: The Perception of Materials by Humans and Machines.

Register a COCO Format Dataset

If your instance-level (detection, segmentation, keypoint) dataset is already a json file in the COCO format, the dataset and its associated metadata can be registered easily with:

from import register_coco_instances
register_coco_instances("my_dataset", {}, "json_annotation.json", "path/to/image/dir")

If your dataset is in COCO format but need to be further processed, or has extra custom per-instance annotations, the load_coco_json function might be useful.

Update the Config for New Datasets

Once you’ve registered the dataset, you can use the name of the dataset (e.g., “my_dataset” in example above) in cfg.DATASETS.{TRAIN,TEST}. There are other configs you might want to change to train or evaluate on new datasets:

  • MODEL.ROI_HEADS.NUM_CLASSES and MODEL.RETINANET.NUM_CLASSES are the number of thing classes for R-CNN and RetinaNet models, respectively.

  • MODEL.ROI_KEYPOINT_HEAD.NUM_KEYPOINTS sets the number of keypoints for Keypoint R-CNN. You’ll also need to set Keypoint OKS with TEST.KEYPOINT_OKS_SIGMAS for evaluation.

  • MODEL.SEM_SEG_HEAD.NUM_CLASSES sets the number of stuff classes for Semantic FPN & Panoptic FPN.

  • TEST.DETECTIONS_PER_IMAGE controls the maximum number of objects to be detected. Set it to a larger number if test images may contain >100 objects.

  • If you’re training Fast R-CNN (with precomputed proposals), DATASETS.PROPOSAL_FILES_{TRAIN,TEST} need to match the datasets. The format of proposal files are documented here.

New models (e.g. TensorMask, PointRend) often have similar configs of their own that need to be changed as well.


After changing the number of classes, certain layers in a pre-trained model will become incompatible and therefore cannot be loaded to the new model. This is expected, and loading such pre-trained models will produce warnings about such layers.