Instance segmentation using Mask R-CNN

This article discusses the instance segmentation technique, Mask R-CNN, with a brief introduction to object detection techniques such as R-CNN, Fast R-CNN, Faster R-CNN.

Aditi Mittal
5 min readJun 17, 2019


Instance Segmentation

There are various techniques that are used in computer vision tasks. Some of them include classification, semantic segmentation, object detection, and instance segmentation. Classification tells us that the image belongs to a particular class. It doesn’t consider the detailed pixel level structure of the image. It consists of making a prediction for a whole input. Semantic segmentation makes dense predictions inferring labels for each pixel so that every pixel in the image is labeled with the class of its enclosing object. Object detection provides not only the classes but also indicate the spatial location of those classes. It takes into account the overlapping of objects. Instance segmentation includes identification of boundaries of the objects at the detailed pixel level. For example, in the image below there are 7 balloons at certain locations, and these are the pixels that belong to each one of the balloons.

Image Segmentation

Region Based Convolution Neural Network (R-CNN)

R-CNN is widely used in solving the problem of object detection. It creates a boundary around every object that is present in the given image. It can be done in two steps: region proposal step and the classification step. Classification step consists of extraction of feature vectors and set of linear SVMs.
To solve the problem of selecting a huge number of regions, a selective search is used to extract just 2000 regions from the image and this is known as region proposals. Therefore, instead of trying to classify a large number of regions, we can just work with 2000 regions. The selective search algorithm can be performed in the following steps:

1. Generate initial sub-segmentation (many candidate regions)

2. Use a greedy algorithm to recursively combine similar regions

3. Use generated regions to produce the final region proposals

These proposed regions are then fed into the convolutional neural network and produces a 4096-dimensional feature vector as output. CNN extracts a feature vector for each region which is then used as an input to the set of SVMs that outputs a class label. The algorithm also predicts four offset values to increase the precision of the bounding box.
The main problem with R-CNN is that it still requires a large amount of time to train and thus cannot be implemented for real-time problems.

Fast R-CNN

Fast R-CNN also uses the selective search algorithm but solves the problem of R-CNN of being slow by sharing computation of the convolutional layers between different region proposals. In this technique, the image is given as input to CNN that generates a convolutional feature map as the output. Then regions of proposals are identified from convolutional feature map and pass them through RoI pooling layer which reshapes them into a fixed size. We use the fully connected layers (FC) for mapping the fixed size feature map to a feature vector. Then we use a softmax layer for prediction of a class of the proposed region and of the offset values for the bounding box.

Fast R-CNN

It is faster than R-CNN as all the 2000 region of proposals are not fed to the convolutional neural network every time.

Faster R-CNN

It has an additional component after the last convolutional layer known as region proposal network (RPN).

Faster R-CNN

In this technique, the image is given as an input to CNN which provides a convolutional feature map. Instead of using a selective search algorithm on the feature map, RPN is used to predict the region proposals. The region proposals are then reshaped using an RoI pooling layer which is then used to predict the class of the proposed region and offset values for the bounding boxes.

Mask R-CNN

Mask R-CNN is an instance segmentation technique which locates each pixel of every object in the image instead of the bounding boxes. It has two stages: region proposals and then classifying the proposals and generating bounding boxes and masks. It does so by using an additional fully convolutional network on top of a CNN based feature map with input as feature map and gives matrix with 1 on all locations where the pixel belongs to the object and 0 elsewhere as the output.

Mask R-CNN

It consists of a backbone network which is a standard CNN such as ResNet50 or ResNet101. The early layer of network detect low-level features, and later layers detect higher-level features. The image is converted from 1024x1024px x 3 (RGB) to a feature map of shape 32x32x2048. The Feature Pyramid Network (FPN) was an extension of the backbone network which can better represent objects at multiple scales. It consists of two pyramids where the second pyramid receives the high-level features from the first pyramid and passes them to the lower layers. This allows every level to have access to both lower and higher level features.
It also uses the Region Proposal Network (RPN) which scans all FPN top to bottom and proposes regions which may contain objects. It uses anchors which are a set of boxes with predefined locations and scales itself according to the input images. Individual anchors are assigned to the ground-truth classes and bounding boxes. RPN generates two outputs for each anchor — anchor class and bounding box specifications. The anchor class is either foreground class or a background class.
Another module that is different in Mask R-CNN is the ROI Pooling. The authors of Mask R-CNN concluded that the regions of the feature map selected by RoIPool were slightly misaligned from the regions of the original image. Since image segmentation requires specificity at the pixel level of the image, this leads to inaccuracies. This problem was solved by using RoIAlign in which the feature map is sampled at different points and then a bilinear interpolation is applied to get a precise idea of what would be at pixel 2.93 (which was earlier considered as pixel 2 by the RoIPool).
Then a convolutional network is used which takes the regions selected by the ROI classifier and generates masks for them. The generated masks are of low resolution- 28x28 pixels. During training, the masks are scaled down to 28x28 to compute the loss, and during inferencing, the predicted masks are scaled up to the size of the ROI bounding box. This gives us the final masks for every object.


  7. (Part 1–3)

Hopefully, this article was a useful introduction to Mask R-CNN. Thanks for reading! Also, add any other points or concepts that I should have added below in the comments!