Spatial Path & Context Path
The recipe of the Success of BiSeNet (Bilateral Segmentation Network).
Bilateral Segmentation Network is a state-of-the-art novel approach to Real-time Semantic Segmentation which employs two main novel approaches:
- Spatial Path
- Context Path
Semantic segmentation is a fundamental task in computer vision. The main goal of it is to assign semantic labels to each pixel in a image.
The applications of this technology are very broad. It can be applied to the fields of Augmented Reality(AR), Autonomous Driving, Robotics and Video surveillance. But the applications have high demand for efficient inference speed for fast interaction or response.
Note: I assume you are familiar with Convolution Neural Networks(CNN).
So what is Spatial Path, Context Path and how do they affect the performance of real-time semantic segmentation system?
These two components are devised to confront with the loss of spatial information( high-resolution features in a image/video) which is crucial to predicting the detailed output and also the shrinkage of receptive field( the region in the input space that a particular CNN’s feature is looking at) which is also crucial to cover large objects which lead to rich discriminative ability.
With that said, lets get into it. Shall we ?!
What is Spatial Path?
Spatial Path is a method proposed to preserve spatial size of the original input image and encode affluent spatial information(features). The Spatial Path is made of three convolutions layers. Each convolution layer has a stride = 2, followed by batch normalization and ReLU(Rectified Linear Unit).
This path extracts the output feature maps that is 1/8 (about 12.5%) of the original image. It encodes rich spatial information due to the large spatial size of feature maps.
What is Context Path ?
Context Path is designed to working hand-to-hand with the Spatial path, providing sufficient receptive field. In the semantic segmentation task, receptive field is of great significance for the performance. To enlarge the receptive field, there are some approaches like:
- Pyramid pooling module
- Atrous spatial pyramid pooling or “large kernel” and etc.
But all the above have computational demanding and memory consuming operations, which result in low speed.
Considering the large receptive field and simultaneously efficient computation requirements, context path is the best fit to the task.
It utilizes lightweight model and global average pooling to provide large receptive field. A perfect example of a lightweight model would be Xception model which is deep convolutional neural network architecture inspired by Inception, where Inception modules have been replaced with depthwise separable convolutions. Xception model can downsample the feature map fast to obtain large receptive field, which encodes high level semantic context information. Then add a global average pooling on the tail of the lightweight model, which can provide the maximum receptive field with global context information, therefore being able to capture more information from the input image/video regardless of the size of the input.
Thank you very much for reading this post. If you like it give it a clap.
Please comment bellow what you think. If there is any improvements, doubts or error too, I am open and glad to answer you.
Sources: