Beyond Neural Networks
In this article, we are diving deeper into Deep learning Architectures for Computer Vision, Natural Language Processing & Audio Processing.
Neural networks(NN) have given machine learning products and services superpowers that look like they are coming out of a sci-fi movie, don’t you think?
It’s a vital component of machine learning and used till this very day by most state-of-the-art computer vision architectures that win competitions such as ImageNet Challenges know as ImageNet Large Scale Visual Recognition Challenge(ILSVR for short) and many others.
Computer vision is a scientific field that aims to enable a computer to see and understand the world from a visual perspective, such as identifing, localizing objects, tracking and etc.
ImageNet is an image database of +14M images, NN alone cannot learn from such ginormous amount of data, specifically because images are not 2D (width x height ) as we perceive and see in our gadgets, there is a 3rd dimension where the colour channels reside namely Red, Green and Blue(RGB), which array basically matrices of values that vary from 0–255 and when overlapped represent a coloured image(width x height x channels). There are cases where we have 1 channel, which represents a black and white image.
When we are training our algorithm normally we have one more dimension added to the mix called batch dimension, responsible for holding the number of samples(images) in the training set you are training with, thus making it 4D matrix (samples, width, height, channels).
When getting started in machine learning(ML), you might have heard or will hear a lot about tensors. Tensors are at the beating heart of every Deep learning library out there, it allows us to do mathematical and numerical operations efficiently. We represent these images as tensors when training, testing or inferring from an ML model.
When we speak of tensors we are generally speaking of the concept of that matrix with N ≥ 3 dimensions is considered a tensor.
Note: From this point forward I assume you read my previous article, if not please go check it out.
Going beyond Neural Networks
Now that we understand the dynamics of an image and what is a tensor we can dig deeper and understand why we can’t use simple Fully Connected Neural Networks alone for Computer Vision.
First, computer vision and many other fields have problems that require what I can call ‘algorithmic teamwork’ where we merge two or more algorithms to solve the given problem.
NNs work best with features vectors(patterns in an image like shapes) for the task of classification and the best way to get this feature vector is by using an algorithm that can extract this features for us automatically instead of engineering them by hand which would turn out to be a very tedious job for any human to do especially large-sized images.
For this there is a more complex variation of NN called Convolutional Neural Networks(CNN), this algorithm provided enough depth of stacked layers can learn representations in images, videos, audios automatically. CNNs allows us to take groups from pixels(normally 3x3 matrix called kernel) from an image and do a dot product to form another tensor of either the same size or smaller called filter — commonly called the sliding window algorithm because it does this dot product operation in groups across the entire image.
This kernel is a 3x3 matrix of values that can be adjusted to extract any type of feature like the shape of cats filter.
Bear with me all will be cleared!!!
As you can see the tensor in blue is our image and the green tensor represents the filter.
With the power of CNNs we can extract all the features from any image and feed it to out NN to classify it, this is the secret sauce that AI researchers have been using for many years now and only truly become famous after 2012 when AlexNet won the ILSVR 2012 challenge.
Inside an architecture such as AlexNet, we have convolution blocks, pooling block and at the end, we take the filter tensor and roll it into a vector of size (features, 1) where it then feeds into our Fully Connected Neural Network for classification.
A pooling block basically is a way of further reducing the size of the features(image) without losing much information. Max Pooling extracts the maximum value of each channel. It is conceptually similar to Convolution, except instead of transforming groups of matrices across all channels via a learned linear transformation(convolutional kernel), they’re transformed using max tensor operation at each channel.
CNNs are very powerful they have delivered superhuman level accuracy to AI algorithms catapulting the field into one of the top 3 trending technologies of the year for many consecutive years till today — not only that it also created a lot of interest from investors and entrepreneurs to experiment business and investments which grew the AI field market share and now is expected to grow from USD 21.46 Billion in 2018 to USD 190.61 Billion by 2025, according to MarketsAndMarkets website.
I will release the code for this article in a few days and will update this article.
There are newer and better techniques which I will be covering in the coming articles.
3 Key takeaways:
- NN cannot solve computer vision problems alone.
- CNNs are a combination of convolutional blocks, pooling blocks followed by a Fully Connected NN at the end for classification.
- CNN extract the necessary features for any Computer Vision, Audio Processing or Natural Language Processing which can then be used for classification for example.
CHALLENGE
Your challenge of the week is to implement a CNN using Keras Deep learning library and use it on MNIST dataset and Tweet me your GitHub profile https://twitter.com/CanumaGdt
Thank you for reading. If you have any thoughts, comments or critics please comment down below.
Follow me on twitter at Prince Canuma, so you can always be up to date with the AI field.
If you like it and relate to it, please give me a round of applause 👏👏 👏(+50) and share it with your friends.