Models (and their sub-models) in detectron2 are built by
functions such as
from detectron2.modeling import build_model model = build_model(cfg) # returns a torch.nn.Module
build_model only builds the model structure, and fill it with random parameters.
See below for how to load an existing checkpoint to the model,
and how to use the
Load/Save a Checkpoint¶
from detectron2.checkpoint import DetectionCheckpointer DetectionCheckpointer(model).load(file_path) # load a file to model checkpointer = DetectionCheckpointer(model, save_dir="output") checkpointer.save("model_999") # save to output/model_999.pth
Detectron2’s checkpointer recognizes models in pytorch’s
.pth format, as well as the
in our model zoo.
See API doc
for more details about its usage.
The model files can be arbitrarily manipulated using
.pth files or
Use a Model¶
A model can be called by
outputs = model(inputs), where
inputs is a
Each dict corresponds to one image and the required keys
depend on the type of model, and whether the model is in training or evaluation mode.
For example, in order to do inference,
all existing models expect the “image” key, and optionally “height” and “width”.
The detailed format of inputs and outputs of existing models are explained below.
When in training mode, all models are required to be used under an
The training statistics will be put into the storage:
from detectron2.utils.events import EventStorage with EventStorage() as storage: losses = model(inputs)
If you only want to do simple inference using an existing model, DefaultPredictor is a wrapper around model that provides such basic functionality. It includes default behavior including model loading, preprocessing, and operates on single image rather than batches.
Model Input Format¶
Users can implement custom models that support any arbitrary input format.
Here we describe the standard input format that all builtin models support in detectron2.
They all take a
list[dict] as the inputs. Each dict
corresponds to information about one image.
The dict may contain the following keys:
Tensorin (C, H, W) format. The meaning of channels are defined by
cfg.INPUT.FORMAT. Image normalization, if any, will be performed inside the model using
“instances”: an Instances object, with the following fields:
“gt_boxes”: a Boxes object storing N boxes, one for each instance.
Tensorof long type, a vector of N labels, in range [0, num_categories).
“gt_keypoints”: a Keypoints object storing N keypoint sets, one for each instance.
“proposals”: an Instances object used only in Fast R-CNN style models, with the following fields:
“proposal_boxes”: a Boxes object storing P proposal boxes.
Tensor, a vector of P scores, one for each proposal.
“height”, “width”: the desired output height and width, which is not necessarily the same as the height or width of the
imageinput field. For example, the
imageinput field might be a resized image, but you may want the outputs to be in original resolution.
If provided, the model will produce output in this resolution, rather than in the resolution of the
imageas input into the model. This is more efficient and accurate.
Tensor[int]in (H, W) format. The semantic segmentation ground truth. Values represent category labels starting from 0.
Model Output Format¶
When in training mode, the builtin models output a
dict[str->ScalarTensor] with all the losses.
When in inference mode, the builtin models output a
list[dict], one dict for each image.
Based on the tasks the model is doing, each dict may contain the following fields:
“instances”: Instances object with the following fields:
“pred_boxes”: Boxes object storing N boxes, one for each detected instance.
Tensor, a vector of N scores.
Tensor, a vector of N labels in range [0, num_categories).
Tensorof shape (N, H, W), masks for each detected instance.
Tensorof shape (N, num_keypoint, 3). Each row in the last dimension is (x, y, score). Scores are larger than 0.
Tensorof (num_categories, H, W), the semantic segmentation prediction.
“proposals”: Instances object with the following fields:
“proposal_boxes”: Boxes object storing N boxes.
“objectness_logits”: a torch vector of N scores.
“panoptic_seg”: A tuple of
(Tensor, list[dict]). The tensor has shape (H, W), where each element represent the segment id of the pixel. Each dict describes one segment id and has the following fields:
“id”: the segment id
“isthing”: whether the segment is a thing or stuff
“category_id”: the category id of this segment. It represents the thing class id when
isthing==True, and the stuff class id otherwise.
Partially execute a model:¶
Sometimes you may want to obtain an intermediate tensor inside a model. Since there are typically hundreds of intermediate tensors, there isn’t an API that provides you the intermediate result you need. You have the following options:
Write a (sub)model. Following the tutorial, you can rewrite a model component (e.g. a head of a model), such that it does the same thing as the existing component, but returns the output you need.
Partially execute a model. You can create the model as usual, but use custom code to execute it instead of its
forward(). For example, the following code obtains mask features before mask head.
images = ImageList.from_tensors(...) # preprocessed input tensor model = build_model(cfg) features = model.backbone(images.tensor) proposals, _ = model.proposal_generator(images, features) instances = model.roi_heads._forward_box(features, proposals) mask_features = [features[f] for f in model.roi_heads.in_features] mask_features = model.roi_heads.mask_pooler(mask_features, [x.pred_boxes for x in instances])
Note that both options require you to read the existing forward code to understand how to write code to obtain the outputs you need.