Introduction to U-Net and Res-Net for Image Segmentation

Aditi Mittal
4 min readJun 3, 2019


Human beings are able to see the images by capturing the reflected light rays which is a very complex task. So how can machines be programmed to perform a similar task? Computer sees the images as matrices which need to be processed to get a meaning out of it.

Image segmentation is the method to partition the image into various segments with each segment having a different entity. Convolutional Neural Networks are successful for simpler images but haven’t given good results for complex images. This is where other algorithms like U-Net and Res-Net come into play.

Background — Convolutional Neural Network (CNN)

CNNs are similar to a neural network with various neutrons with learnable weights and biases. Each neuron is given a number of inputs, weighted sum is performed, activation function is applied and output is given. The network has a loss function which is used to minimize the error in weights. A machine sees an image as a matrix of pixels with image resolution as h x w x d where h is the height, w is the width and d is the dimension. d depends on the color scale such as 3 for RGB scale and 1 for grayscale. In CNN, the image is converted into a vector which is largely used in classification problems. But in U-Net, an image is converted into a vector and then the same mapping is used to convert it again to an image. This reduces the distortion by preserving the original structure of the image.
CNN is largely used when the whole image is needed to be classified as a class label. But many tasks requires to classify each pixel of the image. This is solved by the U-net and Res-Net.

U-Net Architecture

U-Net consists of Convolution Operation, Max Pooling, ReLU Activation, Concatenation and Up Sampling Layers and three sections: contraction, bottleneck, and expansion section. the contractions section has 4 contraction blocks. Every contraction block gets an input, applies two 3X3 convolution ReLu layers and then a 2X2 max pooling. The number of feature maps gets double at each pooling layer. The bottleneck layer uses two 3X3 Conv layers and 2X2 up convolution layer. The expansion section consists of several expansion blocks with each block passing the input to two 3X3 Conv layers and a 2X2 upsampling layer that halves the number of feature channels. It also includes a concatenation with the correspondingly cropped feature map from the contracting path. In the end, 1X1 Conv layer is used to make the number of feature maps as same as the number of segments which are desired in the output. U-net uses a loss function for each pixel of the image. This helps in easy identification of individual cells within the segmentation map. Softmax is applied to each pixel followed by a loss function. This converts the segmentation problem into a classification problem where we need to classify each pixel to one of the classes.

Residual Networks (Res-Net)

In traditional neural networks, more layers mean a better network but because of the vanishing gradient problem, weights of the first layer won’t be updated correctly through the back-propagation. As the error gradient is back-propagated to earlier layers, repeated multiplication makes the gradient small. Thus, with more layers in the networks, its performance gets saturated and starts decreasing rapidly. Res-Net solves this problem by using the identity matrix. When the back-propagation is done through identity function, the gradient will be multiplied only by 1. This preserves the input and avoids any loss in the information.

Res-Net Architecture

Components of a network include 3X3 filters, CNN down-sampling layers with stride 2, global average pooling layer and a 1000-way fully-connected layer with softmax in the end.
ResNet uses a skip connection in which an original input is also added to the output of the convolution block. This helps in solving the problem of vanishing gradient by allowing an alternative path for the gradient to flow through. Also, they use identity function which helps higher layer to perform as good as a lower layer, and not worse.

Residual Block

In traditional neural networks, each layer feeds into the next layer. But in a network with residual blocks, each layer feeds into the next layer and directly into the layers about some hops away.

Consider a neural network block, whose input is x and we would like to learn the true distribution H(x). The residual between the output and input can be denoted as:

R(x) = Output - Input = H(x) - x

The layers in a traditional network learn the true output (H(x)) whereas the layers in a residual network learn the residual (R(x)).

Hopefully, this article was a useful introduction to Res-Nets and U-Nets. Thanks for reading! Also, add any other points or concepts that I should have added below in the comments!