Breast Cancer Prediction using K-Nearest Neighbors Algorithm (KNN)

Introduction

Breast cancer is a cancer that develops in breast cells and when we see the statistics of 2019 in U.S, About 1 in 8 U.S. women (about 12%) will develop invasive breast cancer over the course of her lifetime. Also, for women in the U.S., breast cancer death rates are higher than those for any other cancer, besides lung cancer.  So the early diagnosis is very important in case of survival.It’s important to understand that most breast lumps are benign and not cancer (malignant)Non-cancerous breast tumors are abnormal growths, but they do not spread outside of the breast. So another big challenge is to identify if the cancer lumps are malignant or benign. In this article, we will see how to identify breast cancer using K-Nearest Neighbors algorithm.

K-Nearest Neighbors Algorithm

K-Nearest Neighbors Algorithm is one of the most simple and easily interpretable supervised machine learning algorithm. One specialty of K-NN is that, it does not have a separate training phase. The algorithm takes the whole data as training set.  Yet, it performs powerful tasks that helps us to resolve complex machine learning tasks.  KNN can be used to solve both classification and regression problems, however, it is generally used to solve classification problems.

KNN Classification

We will take a simple case to understand the algorithm, Assume we have to determine the class of “?”.  The K in KNN means how many near by neighbors we wish to consider for voting for “?”. When K=3, we consider three adjacent datapoints we have 2 out of 3 data points  as blue circle so When K=3, new point will be classified as  blue circle.  In the same way, In case of K=7 out of 7 points 4 adjacent datapoints are of class green circle. So the new data point will be classified to green circle.
To summarize this, KNN simply calculates the distance of an unknown/new data point to all other training data points. The Metric generally used for the distance calculation are Euclidean, Manhattan etc. It then selects the K-nearest data points and assigns the new data point the class to which the majority of the K data points belong.

Dataset

In this classification task, we use the Breast cancer wisconsin (diagnostic) dataset to predict whether the cancer is benign or malignant. The dataset features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. The data has only two labels Malignant(M) or Benign(B).

Importing the libraries and Reading Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

df=pd.read_csv("breast_cancer.csv")
df.head()
The top 5 rows of the dataframe are displayed as below.
Display pandas dataframe head

The diagnosis column is our target variable and you can notice that we have one unwanted column with all NaN values ‘Unnamed:32’. Also note that the ID column has no significance, So, we can remove both ID and Unnamed:32 columns.

df.drop(['id', 'Unnamed: 32'], axis = 1,inplace=True)
#check wheter any of the columns contain null values
df.isnull().sum()

The data does not contain missing values. Which means the data is very clean and polished. The label values are ‘M’ and ‘B’ corresponding to the Malignant and Benign classes. We can convert them to 0 and 1 respectively.

ctypes ={'M' : 1, 'B' : 0} 
df['diagnosis'] = df['diagnosis'].map(ctypes)

Visualizing the Data

We consider few features like radius_mean,texture_mean  and perimeter_mean to visualize the data distribution.
sns.pairplot(df,vars=['radius_mean','texture_mean','perimeter_mean'],hue='diagnosis')
Seaborn Pairplot for Breast Cancer Data

Creating the KNN Model

#loading libraries
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

Before feeding the data to the algorithm, we split the data into labels and features.

X = np.array(df.iloc[:,1:])
y = np.array(df['diagnosis'])
Generally you can use the complete data for the model. But for evaluating the model we have taken train and test seperately for KNN.
# test train split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.33, random_state = 42)

Identifying the optimal value of k

If we randomly choose the value of K, that will not guarantee a good result.  One way to help you find the best value of K is to plot the graph of K value and the corresponding error rate for the dataset. We use the cross validation error to find this out. The K value with least CV error will be the optimal value.

#Performing 10 fold cross validation
from sklearn.model_selection import cross_val_score
nbrs = []
cv_scores = []
for k in range(1,40):
   nbrs.append(k)
   knn = KNeighborsClassifier(n_neighbors = k)
   scores = cross_val_score(knn,X_train,y_train,cv=10, scoring = 'accuracy')
   cv_scores.append(scores.mean())
print(cv_scores)

The output is given as:

[0.8946711894080316, 0.910333005069847, 0.9023634971003393, 0.9077689025057445, 0.9076977787504104, 0.9053434000802423, 0.9076977787504104, 0.8972352919721341, 0.9025695736222051, 0.9025695736222051, 0.9079749790276106, 0.9050625524309733, 0.9103968340810444, 0.9049914286756392, 0.9049950760477076, 0.9049950760477076, 0.9023634971003391, 0.9049275996644418, 0.9023634971003391, 0.9049950760477076, 0.9049275996644418, 0.9049275996644418, 0.9049275996644418, 0.9023634971003393, 0.9023634971003391, 0.9023634971003391, 0.9023634971003391, 0.8971678155888683, 0.9023634971003393, 0.8997319181529708, 0.9023634971003391, 0.9048564759091076, 0.902224896961739, 0.8970292154502681, 0.902224896961739, 0.8996607943976365, 0.902224896961739, 0.8996607943976365, 0.8970966918335339]

The above are the scores for each k using 10 fold cross validation. You can see that K=2 and K=13 are giving the best score and hence the least error. To be more specific, from this we can calculate the miss-classification error (1-CVScore) and identify the optimal value of k.

#Misclassification error
MSE = [1-x for x in cv_scores]

#Optimal value of k, with least MSE
optimal_k = nbrs[MSE.index(min(MSE))]
print('The optimal value of K (neighbors) is %d ' %optimal_k)

The output of the statement is:

The optimal value of K (neighbors) is 13 

We can visualize the scores for each value of k, by the below plot using matplotlib.

plt.figure(figsize=(12, 6))
plt.plot(range(1, 40), MSE, color='red', linestyle='dashed', marker='o',
markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')
Cross Validation Score Graph
#Creating the Model with selected optimal value
knn = KNeighborsClassifier(n_neighbors = 13)
knn.fit(X_train,y_train)
The model created will be:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=13, p=2,                   weights='uniform')

Evaluating the Model

from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
acc=accuracy_score(y_test,y_predict)
print(f'Accuracy Score of the Model: {acc}')

Accuracy Score of the Model: 0.9627659574468085

print('Confusion Matrix :\n')
print(confusion_matrix(y_test,y_predict))
print(f'\nClassification Report \n\n {classification_report(y_test,y_predict)}')
Classification Report and Confusion Matrix KNN

So, here we got a pretty good model for classification of breast cancer using KNN.

Conclusion

Here we have successfully created a good model for breast cancer classification. The same experiment can be performed with SVM, logistic regression or any classification algorithm. You can change the parameters and data preprocessing and always improve the results.As mentioned earlier, it is important to note that K-Nearest Neighbors Algorithm doesn’t always perform as well with high-dimensionality or categorical features. The main objective of this task is to introduce the simple yet powerful ML algorithm KNN with a real-wold usecase.

Leave a Reply

Your email address will not be published. Required fields are marked *