Goals of Data Analysis and Model Selection
In this article, we are going to dissect these two topics and understand the core.
First and foremost, Merry Christmas and a prosperous new year.
Since Alpha Go an AI system developed by Google’s DeepMind that plays the board game Go beat the world champion of Go and IBM Watson a question-answering AI system capable of answering questions posed in natural language, developed in IBM’s DeepQA won the Jeopardy, the world changed forever.
We changed our view of data and the mining, analysis and potential uses of it. I believe that data has become the new oil and as Andrew Ng says ‘AI has become the new electricity’, he is one of the godfathers of AI and amazing academic.
Why did this happen?
In 2010 Deep Learning become feasible, meaning we started to have a wide availability of powerful and cheap GPUs capable of parallel processing, practically infinite storage of every kind of data, including images and video. It all led to a lot of experiments and researches taking place, one of which that culminated with the IBM Watson victory in 2011 and 5 years later in 2016 Alpha Go beat the Go champion Lee Sedol in a five-game match.
So what made these victories and many other possibilities?
There are many factors that contribute to such achievements in the AI field, yet other than hardware evolution and research, and I think this boils down to two main topics, which is Data Analysis and Model Selection techniques.
Bear with me, what fuels most state-of-the-art AI systems if not all is data, we have created a colossal amount of data in the last 6–8 years, according to Tech Giant IBM 2.5 exabytes — that’s 2.5 billion gigabytes (GB) — of data was generated every day in 2012. That’s big by anyone’s standards.
“About 75% of data is unstructured, coming from sources such as text, voice and video”. — Miles
Now, following that train of thought we know that is nearly impossible for humans to learn and structured that much data efficiently without using tools, it would take us a few hundred to thousand years to come up with useful insights. So would be the case with many classical search and sorting algorithms that would literally freeze any machine if for e.g. sorting a 10 GB word file in alphabetical order.
What do we do?
We create intelligent systems(models) that work tirelessly day and night, 366 days non-stop, to do most of the hard work of learning from that data and bring us results for further analyses made by a human.
Goals of Data Analysis and model selection
There are two main objectives in learning from data.
- One is for scientific discovery, interpretation of the nature of data, the grouping of data and data generation process. For e.g. A scientist may use data to not only support but also develop his physical model, discovering a new chemical component or identifying genes that promote the onset of a disease.
- Another objective of learning from learning from data is predictive power. For e.g. Predict a future observation based on past data with an amazing accuracy.
In tune with the two different objectives above, model selection can also have two directions.
- Model selection for inference
Intended to identify the best model for the data.
- Model selection for prediction
Intend to identify the model which offers the best performance.
In many applications, prediction accuracy is the dominating consideration. The major difference between the two model selection techniques is that in statistical learning is the (stronger) requirement that the selected model should exhibit prediction loss comparable to the best offered by candidates. In other words that the prediction loss of the model you are selecting should be close to the model that has the loss prediction loss(best model) as features increase to infinity).
Feature — features are attributes that describe data our system is trying to learn, if we want to predict house prices some examples of feature would be, number of bedrooms in a house, size of the house, year built and etc.
Further Research
If you want to learn more and dive into the specifics(metrics, techniques and maths behind) check out this awesome post at Neptune.ai (https://bit.ly/3lSfQM2).
Points to remember…
To sum up, the goals of data analysis and model selection is to efficiently build a system for :
- Processing data, visualization, feature extraction(manual or automated)
- To efficiently choose the best model based on accuracy or performance, sometimes it’s a good balance between both.
Thank you for reading. If you have any thoughts, comments or critics please comment down below.
If you like it and relate to it, please give me a round of applause 👏👏 👏 and share it with your friends.
Follow me if you want to be at the forefront of this revolution, withness outstanding growth and change how you view AI.