Models (and their sub-models) in detectron2 are built by
functions such as
from detectron2.modeling import build_model model = build_model(cfg) # returns a torch.nn.Module
To load an existing checkpoint to the model, use
Detectron2 recognizes models in pytorch’s
.pth format, as well as the
in our model zoo.
You can use a model by just
outputs = model(inputs).
Next, we explain the inputs/outputs format used by the builtin models in detectron2.
Model Input Format¶
All builtin models take a
list[dict] as the inputs. Each dict
corresponds to information about one image.
The dict may contain the following keys:
Tensorin (C, H, W) format. The meaning of channels are defined by
Instancesobject, with the following fields:
Boxesobject storing N boxes, one for each instance.
Tensorof long type, a vector of N labels, in range [0, num_categories).
- “gt_masks”: a
PolygonMasksobject storing N masks, one for each instance.
- “gt_keypoints”: a
Keypointsobject storing N keypoint sets, one for each instance.
Instancesobject used in Fast R-CNN style models, with the following fields:
Boxesobject storing P proposal boxes.
Tensor, a vector of P scores, one for each proposal.
“height”, “width”: the desired output height and width of the image, not necessarily the same as the height or width of the
imagewhen input into the model, which might be after resizing. For example, it can be the original image height and width before resizing.
If provided, the model will produce output in this resolution, rather than in the resolution of the
imageas input into the model. This is more efficient and accurate.
Tensor[int]in (H, W) format. The semantic segmentation ground truth. Values represent category labels starting from 0.
Model Output Format¶
When in training mode, the builtin models output a
dict[str->ScalarTensor] with all the losses.
When in inference mode, the builtin models output a
list[dict], one dict for each image. Each dict may contain:
“instances”: Instances object with the following fields:
- “pred_boxes”: Boxes object storing N boxes, one for each detected instance.
Tensor, a vector of N scores.
Tensor, a vector of N labels in range [0, num_categories).
- “pred_masks”: a
Tensorof shape (N, H, W), masks for each detected instance.
- “pred_keypoints”: a
Tensorof shape (N, num_keypoint, 3). Each row in the last dimension is (x, y, score).
Tensorof (num_categories, H, W), the semantic segmentation prediction.
“proposals”: Instances object with the following fields:
- “proposal_boxes”: Boxes object storing N boxes.
- “objectness_logits”: a torch vector of N scores.
“panoptic_seg”: A tuple of (Tensor, list[dict]). The tensor has shape (H, W), where each element represent the segment id of the pixel. Each dict describes one segment id and has the following fields:
- “id”: the segment id
- “isthing”: whether the segment is a thing or stuff
- “category_id”: the category id of this segment. It represents the thing
class id when
isthing==True, and the stuff class id otherwise.
How to use a model in your code:¶
Contruct your own
list[dict], with the necessary keys.
For example, for inference, provide dicts with “image”, and optionally “height” and “width”.
Note that when in training mode, all models are required to be used under an
The training statistics will be put into the storage:
from detectron2.utils.events import EventStorage with EventStorage() as storage: losses = model(inputs)