Handling Data Imbalance for Data Science Projects

Introduction

In data science projects, while working with classification problems it is quite common to encounter imbalanced data. Imbalanced classes simply means that the classes of the target variable is unequally distributed. For example, consider a cancer dataset, which has  the details of patients  diagnosed for cancer. The target variable contains two classes Yes (indicating presence of cancer) and No  (indicating not affected by cancer). It is more likely that the No class will have high in number compared to number of cancerous patients (Class – Yes). So this is an example of unbalanced dataset.

The main issue with unbalanced datasets is that, if you directly create a model out of this , the model can be biased towards the dominant class. Further more, when you evaluate the model using metrics like accuracy_score(), you may get good accuracy since the the model may predict the dominant class (Class – No) really well even though it fails to predict the class with less numbers. This can be dangerous in case of medical use cases. So handling the data imbalance is very critical for every machine learning project. This article is intended to help you understand how data imbalance can be handled when you work in real-world use cases in data science.

Dataset :

In this article, we use a dataset which has a set of credit card transactions out of which some are Fraudulent  and some are legitimate. So here we have two classes, and It is obvious that the number of Fraudulent transactions will be comparatively low. The given dataset contains transactions made by credit card holders in September 2013 and it contains around 490 fraudulent and close to 2 lakh legitimate transactions. So this is a genuine example of unbalanced dataset. You can download the dataset  from kaggle or  here

Let us see  the methods for handling the data imbalance.

Exploring the Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
df=pd.read_csv("credicard.csv")
df.head()

The top 5 rows of the dataframe are displayed as below.

Credit Card Fraud Dataframe

Remember that the data contains the transformed values after applying the PCA. This is due to the confidentiality issues in revealing the actual data. The variables ‘Time’ and ‘Amount’ are not been transformed and the target variable is ‘Class’.

df.shape

The value will be obtained as (284807, 31). We have 284807 rows and 31 columns.  Let us see how the class values are distributed.

class_count = df.Class.value_counts()
print('Count of Class 0:', class_count[0])
print('Count of Class 1:', class_count[1])

The output is:

Count of Class 0: 284315
Count of Class 1: 492

Now we can visualize the same using a bar plot. 

plt.figure(figsize=(8, 5))
sns.set_palette("Paired")
plt.title("Class Distribution")
ax=sns.countplot(x = "Class",data=df)
ax.grid()
ax.set(xlabel='Class', ylabel='Count')
Class Distribution Bar Chart

You can see a clear class imbalance as given in the above plot. Let us see some strategies for handling this.  

Handling Data Imbalance using Python

Resampling

Resampling is a widely adopted method for handling the imbalanced data. It can be either over sampling or undersampling.

  • Oversampling : It is the process of adding more  samples to the minority class to increase its number of samples. The simplest way is to make copies of existing samples in the minority class. But one problem with this is, it can cause over-fitting.
Over Sampling for Handling Data Imbalance
  • Undersampling: It is the process of removing samples from the majority class. The simplest way to implement this is to remove records from the majority class but this can cause loss of information.
Under Sampling for Handling Data Imbalance
Implementing Basic Sampling Methods Using Python

For implementing the basic random sampling, we use DataFrame.sample method. This retrieves the random samples for each class and this way the handling data imbalance is made easy.

Let us first count the classes.

count_class_0, count_class_1 = df.Class.value_counts()
print(f'Number of rows belonging to class 0 :{count_class_0}')
print(f'Number of rows belonging to class 1 :{count_class_1}')

The output is shown as:

Number of rows belonging to class 0 :284315
Number of rows belonging to class 1 :492

Now we can divide the dataframe based on the target class.

df_class_0 = df[df['Class'] == 0]
df_class_1 = df[df['Class'] == 1]

 

Performing Undersampling

 

To peform undersampling, pass the number of samples you want to be retained in the majority class as argument to sample() method.

  • Here we want only 492 samples in the class 0. So it randomly removes all the records keeping only 492 records(samples).
  • Remember that count_class_1 is 492.
df_class_0_under = df_class_0.sample(count_class_1)
df_under = pd.concat([df_class_0_under, df_class_1], axis=0)

With the above code, we perform undersampling of class 0 and we concatenate both dataframes. Now both class 0 and class1 has equal number of samples (492). We can visualize this in the form of a bar chart.

df_under.Class.value_counts().plot(kind='bar', title='Class Counts')
Bar plot showing Undersampling
Now both classes have equal (492) number of samples. The resultant dataframe is df_under  with equal number of Class 0 and Class1 samples.
df_under.shape
This outputs :
(984, 31)
Performing Oversampling
In the same way we can perform oversampling by specifying the number of samples you want in the minority class by passing it as argument to the sample() method. Here in our case, we have only 492 samples in class 1, we want to expand this to 284315 records.  We can do this by the following way.
df_class_1_over = df_class_1.sample(count_class_0, replace=True)
df_over = pd.concat([df_class_0, df_class_1_over], axis=0)
print('After Over Sampling:')
print(df_over.Class.value_counts())
This prints the output as below.
After Over Sampling:
1    284315
0    284315
Name: Class, dtype: int64
The same can be visualized using a bar plot.
df_over.Class.value_counts().plot(kind='bar', title='Class Counts')
Oversampling Bar plot

Now both classes have equal (284315) number of samples. The resultant dataframe is df_over  with equal number of Class 0 and Class1 samples.

df_over.shape

This outputs the following:

(568630, 31)

Sampling using Python imbalanced-learn module

Instead of directly sampling the data, python’s imblanced-learn module offers methods such as clustering the data and select samples from clusters, introducing small variations of data while we oversample, etc. We will learn some of the re-sampling techniques using imblearn here.
Imblearn module can be installed using pip.
pip install imblearn
Now to work with our dataframe, we can divide the dataframe into features and labels.
X = df.iloc[:,:-1]
Y=df['Class']
X.shape,Y.shape


Random under-sampling and over-sampling with imbalanced-learn

We can perform both undersampling and oversampling with the help of imblearn. Below code performs undersampling.

from imblearn.under_sampling import RandomUnderSampler
rand_us = RandomUnderSampler()
X_rus, y_rus = rand_us.fit_sample(X, Y)
print(X_rus.shape)
print(y_rus.shape)

This outputs the following:

(984, 30)
(984,)

We can count the class labels before and after the re-sampling.

from collections import Counter
print('Original dataset shape {}'.format(Counter(Y)))
print('Resampled dataset shape {}'.format(Counter(y_rus)))
Original dataset shape Counter({0: 284315, 1: 492})
Resampled dataset shape Counter({0: 492, 1: 492})

In the similar way we can perform the random oversampling.

from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler()
X_ros, y_ros = ros.fit_sample(X, Y)
X_ros.shape,y_ros.shape
print('Original dataset shape {}'.format(Counter(Y)))
print('Resampled dataset shape {}'.format(Counter(y_ros)))
The output of the above code will be:
Original dataset shape Counter({0: 284315, 1: 492})
Resampled dataset shape Counter({0: 284315, 1: 284315})

Undersampling using Cluster Method for Handling data Imbalance

In this method different cluster centroids are generated and this grouping is based on the similarity between the data. While implementing this using python, you can pass the ratio as argument. The ratio indicates majority class : minority class. For example,  if the ratio is specified as {0:5}, it preserves 5 elements from the majority class 0 and all from the minority class 1. This is an effective approach for handling the data imbalance.

from imblearn.under_sampling import ClusterCentroids
centroid = ClusterCentroids(ratio={0: 492})
X_cc, y_cc = centroid.fit_sample(X, Y)
print(f'Shape of X : {X_cc.shape} and Shape of Y : {y_cc.shape}')
The above code keeps only 492 records from the class 0 and all from the Class 1 (As we gave the ratio as {0:492}. This output will be:
Shape of X : (984, 30) and Shape of Y : (984,)
The input data shape and the resampled data shape is compared by below code.
print('Original dataset shape {}'.format(Counter(Y)))
print('Resampled dataset shape {}'.format(Counter(y_cc)))
The output will be:
Original dataset shape Counter({0: 284315, 1: 492})
Resampled dataset shape Counter({0: 492, 1: 492})
Oversampling using SMOTE

Synthetic Minority Oversampling Technique (SMOTE) is an effective method for performing oversampling using the imblearn. It synthesizes elements from the existing data for the minority class. Randomly the points are picked from the minority class and then the k-nearest neighbors are calculated for that selected point. The synthetic points are created between the neighbors and the selected point. The new point generated is based on existing points but never an exact duplicate of existing ones. To make it more clear, let us consider a point on the minority class, Assume k=3,   then we compute 3-nearest neighbors from that same minority class. For each neighbor, consider a line segment  to the selected point. Then we create a new point somewhere between along the line.

Let us see how we can implement this using imblearn.

from imblearn.over_sampling import SMOTE
smote = SMOTE(ratio='minority')
X_smt, y_smt = smote.fit_sample(X, Y)

In the above code, ratio=’minority’ is to indicate that we are sampling the minority class.

print('Original dataset shape {}'.format(Counter(Y)))
print('Oversampled dataset shape {}'.format(Counter(y_smt)))

The output is shown as:

Original dataset shape Counter({0: 284315, 1: 492})
Oversampled dataset shape Counter ({0: 284315, 1: 284315})

Now you can see that our minority class has been over sampled using SMOTE

Undersampling Using Tomek links
Tomek links are pairs of very close instances, but of opposite classes. They are the datapoints that makes the classification process difficult because they are really hard to separate out. The same concept of tomek links are used for undersampling here. Let us see how to implement undersampling using Tomek links.
from imblearn.under_sampling import TomekLinks
tl = TomekLinks(ratio='majority')
X_tl, y_tl = tl.fit_sample(X, Y)
print(f'Shape of X : {X_tl.shape} and Shape of Y : {y_tl.shape}')
Shape of X : (284736, 30) and Shape of Y : (284736,)
Remember this undersampling technique removes only the pairs of very close instances which are of opposite classes. That’s why you cannot see a huge reduction in the samples.
print('Original dataset shape {}'.format(Counter(Y)))
print('Resampled dataset shape {}'.format(Counter(y_tl)))
Original dataset shape Counter({0: 284315, 1: 492})
Resampled dataset shape Counter({0: 284244, 1: 492})
Hybrid Method Combining Both SMOTE and Tomek Links
Imblearn allows combination of SMOTE and Tomek Links which means performing oversampling and then undersampling. SMOTETomek is a hybrid method which does this.
from imblearn.combine import SMOTETomek
smotem = SMOTETomek(ratio='auto')
X_stmk, y_smtmk = smotem.fit_sample(X, Y)
print(f'Shape of X : {X_stmk.shape} and Shape of Y : {y_smtmk.shape}')
This first identify the Tomek links and performs the undersampling, Then the data is oversampled to match the majority class.
Shape of X : (567530, 30) and Shape of Y : (567530,)
Now we can see what change has been made on the data when we sampled.
print(f'Original dataset shape {Counter(Y)}')
print(f'Resampled dataset shape {Counter(y_smtmk)}'
Original dataset shape Counter({0: 284315, 1: 492})
Resampled dataset shape Counter({0: 283765, 1: 283765})

Conclusion

Class Imbalance is a main problem that data scientists encounter in while dealing with real-word usecases. In this article we have covered some of the practical approaches  using python that you can use to tackle this.  Choosing the best method over solving class imbalance depends on the dataset and mostly for oversampling SMOTE is a very well known technique. Overall, we hope this will add value to your knowledge. 

6 thoughts on “Handling Data Imbalance for Data Science Projects”

Leave a Reply

Your email address will not be published. Required fields are marked *