Classifying Breast Cancer Samples

Prince Canuma
9 min readFeb 3, 2019

--

In this article, we are going to explore real-world problem databases and see if we can use AI to automate and solve the problem.

Life, as we know, is about pushing the boundaries of what is considered possible towards the denominated impossible by the masses. With a keen imagination, burning desire and will power to improve, mankind wouldn’t have dominion over every other species on earth because our God-given gifts of intelligence and emotion would have gone to waste.

“There are no great limits to growth because there are no limits of human intelligence, imagination, and wonder.” — Ronald Reagan

Why did I start with such paragraph?

It is for the sole purpose of driving innovation and perhaps kindle the fire burning inside you so you can create the next wave of improvements by solving problems that affect mankind. AI currently is one of the technologies that is helping us solve real-life problems most of which that are out in the physical world and have a direct impact in the way we live and the quality of life.

Disclaimer: If are a total beginner please, this article ( How to develop your AI intuition) before you read this one. This is a technical article.

Acknowledgements: I would like to thank Mrs.Poonam (my Statistics Faculty) and Mrs.Neeta (my Python Faculty) that gave great ideas and helped me improve my project.

In collaboration with them, I managed to beat the state-of-the-art and improve my code in the Wisconsin Breast Cancer Database reaching 97.14%(Training set) and 95.7% (test set)using the entire database of 699 points, thus betting the previous highest accuracy of 95.9%(test set) using only 369 points — more on this in the latter sections.

Wisconsin Breast Cancer Database

This year I had set up to try and tackle real-life problems using AI, in that quest I searched for a dataset in key areas and found the Wisconsin Breast Cancer Database.

“Intelligence without ambition is a bird without wings.” — Salvador Dali

This breast cancer database was obtained from the University of Wisconsin Hospitals, Madison from Dr William H. Wolberg on January 8, 1991.[1]

Although quite old it is still a good place to start and put my skills into practice and see what results I could yield.

Sources:
-- Dr. WIlliam H. Wolberg (physician)
University of Wisconsin Hospitals
Madison, Wisconsin
USA
-- Donor: Olvi Mangasarian (mangasarian@cs.wisc.edu)
Received by David W. Aha (aha@cs.jhu.edu)

This database has 10 attributes that each has 2 possible classes:

  • Malignant
  • Benign.

Dr.William H. Wolberg analyzed samples periodically, the database, therefore, reflects a chronological grouping of data.

This grouping is as follows below:

Group 1: 367 instances (January 1989)
Group 2: 70 instances (October 1989)
Group 3: 31 instances (February 1990)
Group 4: 17 instances (April 1990)
Group 5: 48 instances (August 1990)
Group 6: 49 instances (Updated January 1991)
Group 7: 31 instances (June 1991)
Group 8: 86 instances (November 1991)
Total: 699 points (as of the donated database on 15 July 1992)

Attribute Information:
1. Sample code number
2. Clump Thickness 1–10
3. Uniformity of Cell Size 1–10
4. Uniformity of Cell Shape 1–10
5. Marginal Adhesion 1–10
6. Single Epithelial Cell Size 1–10
7. Bare Nuclei 1–10
8. Bland Chromatin 1–10
9. Normal Nucleoli 1–10
10. Mitoses 1–10
11. Class: (2 for benign, 4 for malignant)

The database has 16 instances in Groups 1 to 6 that contains a single value missing, denoted by “?”.

Class distribution:

  • Benign: 458 (65.5%)
  • Malignant: 241 (34.5%)

ML tools I used

For a time I wanted to learn and test out a powerful python machine learning library called Scikit-learn, everyone in the AI field talks about it and I have read hundreds of articles about it, so with a spirit of adventure I took the challenge and tried it out.

“Setting goals is the first step in turning the invisible into the visible.” — Tony Robbins

From my experience I have to say Scikit-learn is a must have machine learning library on every data scientists belt, it is one of those tools that are there to make your life easier and help you experiment quickly without breaking a sweat — it took me almost an hour to set up everything and do my first prediction, of course counting data pre-processing and post-processing.

If you are a beginner trying to get in Machine learning I totally advise and also if you are a professional looking for rapid prototyping tool and scalable, then this is for you.

It is one of the coolest machine learning libraries out there if not the coolest. Built on NumPy, SciPy and matplotlib.

  • Numpy: It is a python library for mathematical and numerical operations on n-dimensional arrays and matrices.[2]
  • SciPy: is a Python-based ecosystem of open-source software for mathematics, science, and engineering. [3]
  • matplotlib: Matplotlib is a data visualization library for the Python programming language and its numerical mathematics extension NumPy. [4]

Scikit-learn is also known as Sklearn contains many efficient tools for machine learning(ML) and statistical modelling including classification, regression, clustering and dimensionality reduction.

  • Classification: Identifying to which category an object belongs to. Applications: Spam detection, Image recognition.[5]
  • Regression: Predicting a continuous-valued attribute associated with an object. Applications: Drug response, Stock prices.[5]
  • Clustering: Automatic grouping of similar objects into sets. Applications: Customer segmentation, Grouping experiment outcomes.[5]
  • Dimensionality reduction: Reducing the number of random variables to consider. Applications: Visualization, Increased efficiency. [5]

Classifying Breast Cancer

(Protected under the MIT license)

Whenever you want to work on an AI problem there are a set of steps that you have to follow to understand the problem better and to gain more insight from the data — Yes, the data can speak to you and tell you a story!!!

Note: The steps described here are the ones I personally used and worked wonders, yet you are free to create your own if this one don’t suit your needs.

Steps

  1. Download dataset (link) and Read it.

Of course, this one is very obvious, you have to first identify the data set you want to work with and then download it. Datasets come in different ways(i.e. CSV(comma separated values), text, jpg and etc).

The one I used came in text file format and I converted it to excel spreadsheets(CSV) if you want to download the CSV file click here.

To illustrate the breast cancer database looks like this a concentration of 11 different features that describe the breast cancer and its label.

1000025,5,1,1,1,2,1,3,1,1,2
1002945,5,4,4,5,7,10,3,2,1,2
1015425,3,1,1,1,2,2,3,1,1,2
1016277,6,8,8,1,3,4,3,7,1,2

There is one row per patient sample, the first-row represents the Sample code number and the rest follows the order described in the previous Database section under attribute information.

The data can be extracted from the text file and read into a Pandas Dataframe.

I created a file Data.py that contains a class designed and optimized for this dataset.

Make sure you use it if you wish to replicate the same results.

2. Plot Data

Picture from https://www.sydneycommunitycollege.edu.au/course/ARVA10

This is the most important step, because not only you should have data but you need to visualize it and understand it first before feeding it to an algorithm. This step can make or break the success of your solution.

This image I created to know the correlation of the data

The image above shows the correlation between variables and their overall correlation with label Y (Malignant or Benign).

Correlation: a mutual relationship or connection between two or more things.

We can clearly see that all variables are somewhat highly correlated so this is a case of multiple correlations.

More images and info on how I did data visualization in my GitHub page soon.

3. Preprocessing

This is another key aspect when solving an AI problem, sometimes the data doesn’t come clean and complete. Wisconsin Breast Cancer Database, for example, comes with 16 missing values(“?”) in one of the columns of data.

Missing values can be a pain to handle plus they can affect your accuracy. For this, I contacted Mrs.Poonam for her expertise in statistics, as per her advise I replaced all missing values with the standard deviation of the entire column which is approximately 4, it decreased the accuracy by 0.27%, then I tweaked it a bit and put 1 in of 4 and it regained the lost accuracy.

I did not notice an increase in the accuracy in this case, which was weird but I guess in some other case it would have improved it.

Labels

Our labels come as 2's(Benign) and 4's(Malignant), so I created a one encoded vector of labels, basically changed the (2 & 4) with (0 & 1).

One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction. For more information check out this amazing article by Rakshith Vasudev.

4. Training

The most exciting and frustrating part of developing AI systems is training.

Training an AI model is a live example of the famous quote:

“Appearances are often deceiving”. — Aesop

AI algorithms need a lot of experimenting, literally trial an error. The details are out of the scope of this article. Looking forward to creating one article dedicated to the subject of training, named ‘The curse of learning’.

For training, we use Scikit-learn’s Logistic Regression.

screenshot from the colab notebook I made. you can find it here.

Testing

Once training is done we can test how accurate our model is given new features (X) without the labels (Y). It will try to predict the labels of these new features using the knowledge acquired during training.

Analysing results

Now that we trained and test, the next natural step is to analyse our results to see gain further insight of how good is the classifier we trained.

I personally am a fan of confusion_matrix, although the name is a bit scary I can assure that there is nothing confusing about it. This matrix really gives us a deep understanding of how well our model is predicting labels.

This is fantastic!!! We are only mislabeling 8 (Benign) cancer samples out 220 and mislabeling 7 (Malignant) cancer samples out of 219.

Sometimes the accuracy can be biased, i.e. You can have 99.9% accuracy while you model can only predict 5 samples right out of 100.

That’s not the case here 😎.

You can do this too, start by running this Colab notebook, testing with a different dataset and practice. I believe in you!

--

--

Prince Canuma
Prince Canuma

Written by Prince Canuma

Helping research & production teams achieve MLOps success | Ex-@neptune_ai

Responses (1)