Image Segmentation : Attention-guided Chained Context Aggregation (CANET) -Overview

5 min readJul 6, 2022

This article is continuation of my previous article on Image Segmentation here, i will go through the CANET Paper — where the authors initially describes about existing techniques and disadvantages of the techniques prior to CANET. This article will be good start before reading the actual CANET Paper https://arxiv.org/abs/2002.12041.

Issues in Fully Convolutional network:

In typical Neural network which contains convolution and pooling layers , it captures the receptive fields and high level contexts; but because of the continuous pooling layers which downsamples the data, we will get vague boundaries on the position of object.

For e.g if there are 10 houses in an apartments that are separated by walls, then as a result of segmentation we get entire apartment as a block and cannot see individual house boundaries, this is called losing the “Spatial Detail”.

It also causes poor object delineation means the objects exact position is vague. It is also observed that it creates fake regions(called spurious regions) that appears as some objects.

Dilated convolution

The idea of dilated convolution is, instead of convoluting the kernel with the input matrix, insert zeroes in between the matrices and convolute with input. The reason it is comparatively better than the unmodified kernel because it captures the input data if the input matrix is sparse.

For instance, consider the below matrix:

If we have used the original kernel, the resulting convoluted data for the first 3 will be zero, but if the dilated kernel is used, since the matrix shape is increased as well as the boundary of the matrix has data ,it will capture the data of input matrix. In other words, the receptive field now greater than the one with original kernel.

The obvious issue here is, since we are introducing zeroes in between, if the input has data in the regions where there are zeroes in Dilated Kernel, the convoluted output will suppress the effect in those region.

We can observe that the data marked in red is set to zero , nullifying the effect on those regions, means that it does not capture the neighborhood information. i.e. it only covers the area of non-zero which will produce ‘Grid’ like pattern as we stride over the input. This issue is called ‘Gridding’.

In Segmentation, we need both high level data as well as low level information. Capturing the high level is called Global Context and other is called Local Context.

In this scenario (Dilated Convolution), we are losing the local context as we are unable to capture the neighboring pixels accurately. One of the idea to remediate this issue is Context Modules.

Context Modules

Context module is a design to capture both Global and Local Context.

There are many variations of context modules, e.g. Adaptive Context Module, Separate Local and Global Module Implementation etc. but the generic idea is to use combination of below:

1) Conv Layers in Series and Parallel

2) Residual connections

3) Maxpooling layers

Typically these variations use the Context Modules in Parallel with different stride values. For e.g. Below is one such implementation — “Adaptive pyramid context network for semantic segmentation”.

https://ieeexplore.ieee.org/document/8954288

It is observed that due to the parallel connection as each block works on its slide value, there is no interaction between them which results in same pattern in feature extraction. i.e given any kind of data/images the features exacted from them seems to have certain pattern which is inconsistent interms of feature because each data/image possess different features and extracting only specific region will deteriorate the performance.

Stacked Encoder Decoder

Stacked encoder decoder is one of the method that was used to capture accurate location of the object i.e it captures the localisation accurately.

The design is such that the Conv networks forms Encoder-Decoder like structure.

The above structure will be repeated and stacked together and looks as below:

This can be thought as in-series Context modules, it is obvious that the next layer depends on the previous layer. Since this form is very deep, training become slower and complex.

It is also noted that, having such series of blocks reduces the ability to learn general features (called as lacking feature diversity).

Chained Context Aggregation Module (CAM)

To resolve all the above issues CAM module is proposed in this paper. It Contains both Series and Parallel connections. There are two components that makes the module

a) Global Flow (GF) — To capture high level info

b) Context Flow (CF) — To capture low level info

Serial GF and CF

Helps to increase Receptive Field so as to get Localized Information

Parallel GF and CFs

Helps to capture context of different spatial details to get accurate feature maps

Chained Context Aggregation Network (CANet)

Based on all the ideas above, CANET is proposed for semantic image segmentation.

The architecture , advantages e.t.c about CANET will be discussed in next post. Any corrections/suggestions are welcome!.

References:

Attention-guided Chained Context Aggregation for Semantic Segmentation

Understanding Convolution for Semantic Segmentation

Adaptive Pyramid Context Network for Semantic Segmentation

Stacked Hourglass Networks for Human Pose Estimation

Image Segmentation : Attention-guided Chained Context Aggregation (CANET) -Overview

Issues in Fully Convolutional network:

Dilated convolution

Context Modules

Stacked Encoder Decoder

Chained Context Aggregation Module (CAM)

Serial GF and CF

Parallel GF and CFs

Chained Context Aggregation Network (CANet)

References:

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Solomon

No responses yet