(PersonLab) Single-shot fully-convolutional architecture Part II

Picking up where we left from, we are going to discuss startup ideas, see the major benefits that using such architecture brings and understand why using it can benefit you.

A single Shot

We are curious and creative beings right out of the box and we keep always moving forward, opening new doors to a whole new world of possibilities. As I grow older I choose curiosity over fear, not because I don’t feel fear is that despite the presence of fear I let my curiosity subdue my fear.

“Science means constantly walking a tightrope between blind faith and curiosity; between expertise and creativity; between bias and openness; between experience and epiphany; between ambition and passion; and between arrogance and conviction — in short, between an old today and a new tomorrow.” — Heinrich Rohrer

Pose Estimation

Very interesting, look at you gorgeous !!!

Human pose estimation has recently become an area of interest in AI and recently made significant progress on the task of single and multi-person pose estimations, this progress has been facilitated by the use of deep learning based architectures.

“The important thing is not to stop questioning. Curiosity has its own reason for existing.” — Albert Einstein

Why use Graphical models? Graphs are an intuitive way of representing and visualizing the relationships between many variables(in our case body parts).

Following this work, many methods have been proposed to develop tractable inference algorithms for solving the problem of capturing rich dependencies among body parts.

In this paper, the authors used a bottom-up approach for grouping part detections to person instances.

Instance Segmentation

In brief, instance segmentation is the problem of detecting and delineating each distinct object of interest appearing in an image. Current instance segmentation approaches and research consist of ensembles of modules that are trained independently of each other, thus missing opportunities for joint learning.

“Research is formalized curiosity. It is poking and prying with a purpose.” — Zora Neale Hurston

Why merge Pose Estimation and Instance segmentation?

This is a novel approach and it is very exciting because of its vast host of computer vision applications, if used right this combination can take the following areas to the next level:

  • Photo editing (i.e. the famous Bokeh effect)
  • Person and activity recognition
  • Virtual or augmented reality
  • Robotics
  • Autonomous driving
  • Image captioning and visual question answering.

We all are aware that computer vision is currently one of the hottest topics in AI right now.

If you haven’t seen Part I, it will only take you just 5 min to get up to speed and come back, please stop what you are doing go check it out.

Going Back…

Picking up where we left from

The computational cost of this fully convolutional system is largely reduced, it’s computational cost essentially independent of the number of people present in the scene mainly depends on the cost of the CNN feature extraction backbone.

Great news less Nvidia GPUs because it’s cheaper model to train and maintain.

Combining the key-points and instance segmentation feature extraction from an image in a fully convolutional way contributed a lot to the decrease of computational cost. Meaning, from a single CNN feature extraction backbone we get all the features needed to do both tasks Pose Estimation and Instance Segmentation.

First, we get the predictions for all key-points for every person in the image then we learn to predict the relative displacement between each pair of key-points. Once the key-points are localized, a greedy decoding process takes place to group them into person instances.

Note: More on the greedy decoding process and others in the upcoming article.

The approach starts from the most confident detection, as opposed to always starting from a distinguished landmark such as the nose, so it works well even in clutter.

The model is trained using standard COCO keypoint dataset, which annotates multiple people with 12 body and 5 facial key-points. This architecture outperforms the previous state-of-the-art bottom-up approach on keypoint localization which was Associative Embedding: End-to-End Learning for Joint Detection and Grouping. In addition, this is the first bottom-up method to report competitive results on the person class for COCO instance segmentation.

Furthermore, this method is much simpler and hence fast, since it does not require any second stage box-based refinement or clustering algorithm. Therefore, it is suitable for mobile phone deployment.

With such fantastic research work and tool at your disposal, your imagination is the limit.

“Don’t reinvent the wheel, just realign it.” — Anthony J. D’Angelo

Name one problem that affects humanity that you personally could solve with using technology and how would you do it?

Please leave the answer in the comments section below.


Stay tuned Part III is going to be amazing because we are going to go even deeper and understand the math and numerical tricks used to outperformed previous state-of-the-art model while keeping the computational cost low.

Thank you for reading. If you have any thoughts, comments or critics please comment down below.

If this article was useful or insightful in some way for you, please give me a round of applause 👏👏 👏(+50) and share it with your friends.

Follow me if you want to join me on this adventure on AI jungle. :D

Computer Engineering Student, Web Dev. & AI/ML dev