Data Augmentation

Augmentation is an important part of training. Detectron2’s data augmentation system aims at addressing the following goals:

  1. Allow augmenting multiple data types together (e.g., images together with their bounding boxes and masks)

  2. Allow applying a sequence of statically-declared augmentation

  3. Allow adding custom new data types to augment (rotated bounding boxes, video clips, etc.)

  4. Process and manipulate the operations that are applied by augmentations

The first two features cover most of the common use cases, and is also available in other libraries such as albumentations. Supporting other features adds some overhead to detectron2’s augmentation API, which we’ll explain in this tutorial.

This tutorial focuses on how to use augmentations when writing new data loaders, and how to write new augmentations. If you use the default data loader in detectron2, it already supports taking a user-provided list of custom augmentations, as explained in the Dataloader tutorial.

Basic Usage

The basic usage of feature (1) and (2) is like the following:

from detectron2.data import transforms as T
# Define a sequence of augmentations:
augs = T.AugmentationList([
    T.RandomBrightness(0.9, 1.1),
    T.RandomFlip(prob=0.5),
    T.RandomCrop("absolute", (640, 640))
])  # type: T.Augmentation

# Define the augmentation input ("image" required, others optional):
input = T.AugInput(image, boxes=boxes, sem_seg=sem_seg)
# Apply the augmentation:
transform = augs(input)  # type: T.Transform
image_transformed = input.image  # new image
sem_seg_transformed = input.sem_seg  # new semantic segmentation

# For any extra data that needs to be augmented together, use transform, e.g.:
image2_transformed = transform.apply_image(image2)
polygons_transformed = transform.apply_polygons(polygons)

Three basic concepts are involved here. They are:

  • T.Augmentation defines the “policy” to modify inputs.

    • its __call__(AugInput) -> Transform method augments the inputs in-place, and returns the operation that is applied

  • T.Transform implements the actual operations to transform data

    • it has methods such as apply_image, apply_coords that define how to transform each data type

  • T.AugInput stores inputs needed by T.Augmentation and how they should be transformed. This concept is needed for some advanced usage. Using this class directly should be sufficient for all common use cases, since extra data not in T.AugInput can be augmented using the returned transform, as shown in the above example.

Write New Augmentations

Most 2D augmentations only need to know about the input image. Such augmentation can be implemented easily like this:

class MyColorAugmentation(T.Augmentation):
    def get_transform(self, image):
        r = np.random.rand(2)
        return T.ColorTransform(lambda x: x * r[0] + r[1] * 10)

class MyCustomResize(T.Augmentation):
    def get_transform(self, image):
        old_h, old_w = image.shape[:2]
        new_h, new_w = int(old_h * np.random.rand()), int(old_w * 1.5)
        return T.ResizeTransform(old_h, old_w, new_h, new_w)

augs = MyCustomResize()
transform = augs(input)

In addition to image, any attributes of the given AugInput can be used as long as they are part of the function signature, e.g.:

class MyCustomCrop(T.Augmentation):
    def get_transform(self, image, sem_seg):
        # decide where to crop using both image and sem_seg
        return T.CropTransform(...)

augs = MyCustomCrop()
assert hasattr(input, "image") and hasattr(input, "sem_seg")
transform = augs(input)

New transform operation can also be added by subclassing T.Transform.

Advanced Usage

We give a few examples of advanced usages that are enabled by our system. These options can be interesting to new research, although changing them is often not needed for standard use cases.

Custom transform strategy

Instead of only returning the augmented data, detectron2’s Augmentation returns the operations as T.Transform. This allows users to apply custom transform strategy on their data. We use keypoints data as an example.

Keypoints are (x, y) coordinates, but they are not so trivial to augment due to the semantic meaning they carry. Such meaning is only known to the users, therefore users may want to augment them manually by looking at the returned transform. For example, when an image is horizontally flipped, we’d like to swap the keypoint annotations for “left eye” and “right eye”. This can be done like this (included by default in detectron2’s default data loader):

# augs, input are defined as in previous examples
transform = augs(input)  # type: T.Transform
keypoints_xy = transform.apply_coords(keypoints_xy)   # transform the coordinates

# get a list of all transforms that were applied
transforms = T.TransformList([transform]).transforms
# check if it is flipped for odd number of times
do_hflip = sum(isinstance(t, T.HFlipTransform) for t in transforms) % 2 == 1
if do_hflip:
    keypoints_xy = keypoints_xy[flip_indices_mapping]

As another example, keypoints annotations often have a “visibility” field. A sequence of augmentations might augment a visible keypoint out of the image boundary (e.g. with cropping), but then bring it back within the boundary afterwards (e.g. with image padding). If users decide to label such keypoints “invisible”, then the visibility check has to happen after every transform step. This can be achieved by:

transform = augs(input)  # type: T.TransformList
assert isinstance(transform, T.TransformList)
for t in transform.transforms:
    keypoints_xy = t.apply_coords(keypoints_xy)
    visibility &= (keypoints_xy >= [0, 0] & keypoints_xy <= [W, H]).all(axis=1)

# btw, detectron2's `transform_keypoint_annotations` function chooses to label such keypoints "visible":
# keypoints_xy = transform.apply_coords(keypoints_xy)
# visibility &= (keypoints_xy >= [0, 0] & keypoints_xy <= [W, H]).all(axis=1)

Geometrically invert the transform

If images are pre-processed by augmentations before inference, the predicted results such as segmentation masks are localized on the augmented image. We’d like to invert the applied augmentation with the inverse() API, to obtain results on the original image:

transform = augs(input)
pred_mask = make_prediction(input.image)
inv_transform = transform.inverse()
pred_mask_orig = inv_transform.apply_segmentation(pred_mask)

Add new data types

T.Transform supports a few common data types to transform, including images, coordinates, masks, boxes, polygons. It allows registering new data types, e.g.:

@T.HFlipTransform.register_type("rotated_boxes")
def func(flip_transform: T.HFlipTransform, rotated_boxes: Any):
    # do the work
    return flipped_rotated_boxes

t = HFlipTransform(width=800)
transformed_rotated_boxes = t.apply_rotated_boxes(rotated_boxes)  # func will be called

Extend T.AugInput

An augmentation can only access attributes available in the given input. T.AugInput defines “image”, “boxes”, “sem_seg”, which are sufficient for common augmentation strategies to decide how to augment. If not, a custom implementation is needed.

By re-implement the “transform()” method in AugInput, it is also possible to augment different fields in ways that are dependent on each other. Such use case is uncommon (e.g. post-process bounding box based on augmented masks), but allowed by the system.