Feature Selection using Tree Based Method and Recursive Feature Elimination(RFE)

Introduction

Decision Trees and Random Forests are two powerful tree based machine learning algorithms which are predominantly used by Data scientists. One of the remarkable advantage of using tree based algorithm is that they can be easily interpreted. Also this makes it straight forward to derive the importance of each variable on the decision making process of tree based approach. In simple words, In tree based methods it is easy to compute how much each variable contributes to that decision. In this article, we will see two approaches for feature selection using tree based models.

Feature Selection

Generally in datascience we deal with thousands of features out of which sometimes only very few will be carrying the relevant information required for our decision making.  May be just 10-15 features would be important for us. This process of identifying only the most relevant features are called feature selection. Feature selection reduces the computational cost, makes it easy to interpret and more importantly since it reduces the variance of the model, it reduces overfitting.

Next, we will see how random forest helps to select the relevant features.

Random Forest Feature Importance

The random forests are a collection of simple decision trees. It creates a set of decision trees from randomly selected subset of training set and then aggregates the votes from different decision trees to decide the final class of the test object.

Random Forests are very effective in feature selection because the underlying tree based strategy ranks the features by how well they improve the purity of the node. i.e decrease in impurity over all trees.  Remember that nodes highest impurity will be at the top, the impurity decreases when we come down in the tree and the nodes with lowest impurity impurity occur at the end of trees. In essence, features that are selected at the top of the trees are in general more important than features that are selected at the end nodes of the trees. During training time, we determine to what extend a feature decreases the impurity. Using this concept, we can identify the set of most important features.

Now we will see how we can implement the feature selection using direct method (select from model) and using RFE (Recursive Feature Elimination)

1. Selecting Important feature using Random Forest and RFE

In this feature selection task we focus on selecting the feature importance using  Recursive feature elimination. Before the start of implementation, let us understand what RFE is.

Recursive Feature Elimination(RFE):

Recursive Feature Elimination (RFE) recursively eliminates features, builds a model using the remaining features and calculates model accuracy. In this way RFE considers smaller and smaller sets of features recursively to judge the importance.

Now let us first get into the implementation part of feature selection using RFE.

Dataset :

In this feature selection task,  we will be identifying the important features from the “Mobile Price” training dataset . The main aim of this dataset is to predict the price range of mobile phones based on  various attributes.  You can download the dataset from here.  Let’s start!

Importing Libraries :

As a first step, we import all the mandatory libraries for our task

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Reading and Exploring the data :

I have downloaded and kept the data in D drive of my local machine. We can import the data from there using pandas and print the first 5 records.

data = pd.read_csv('D://mobile_price.csv')
data.shape

The output is of the line is (2000, 21). Which means we have 2000 rows and 21 columns.

Out of which all these 21 may not be useful for us. So our aim is to select best n features from this dataset.
We can print the first 5 records of the dataframe.

data.head()

We can separate out the features and the labels.

X=pd.DataFrame(data.iloc[:,:-1])
y=pd.DataFrame(data.iloc[:,-1])
Test Train Split :

Now we can split the data to test and train. It is always recommended to do the feature selection on the train data.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=0)
print(f'X_train shape : {X_train.shape}\nX_test shape :{X_test.shape}')

This splits the data into train and test and displays the shape of the data as below.

X_train shape : (1400, 20)
X_test shape(600, 20)
Creating Random Forest Model:

Now we can create a random forest model

from sklearn.ensemble import RandomForestClassifier
estimator =RandomForestClassifier(n_estimators=100)

We have created a random forest model here, with n_estimator =100. So, the number of trees to be used in the forest is set as 100

Selecting Features using RFE :

We can use RFE and pass in our model created along with the number of features we want to select from the dataset.

from sklearn.feature_selection import RFE
sel_ = RFE(estimator,n_features_to_select=6)

We can fit the new RFE instance with our training data X_train and y_train. So that based on training data, it will select the most important 6 features for us.

sel_.fit(X_train, y_train)

This outputs the following:

Fitting Random Forest Model using RFE
let’s see the features that are selected:
sel_.get_support()

The ‘TRUE’ value at the corresponding index position indicates that the feature is selected and ‘FALSE’ indicates the features removed.

array([ True, False, False, False, False, False, True,
False,  True, False,  True,  True,  True,  True, False,
False, False, False, False, False])

This does not give us the variable name. We can see the exact variable name selected using the below code.

selected_feat = X_train.columns[(sel_.get_support())
print(selected_feat)

This displays the actual feature names.

['battery_power',
 'int_memory',
 'mobile_wt',
 'pc',
 'px_height',
 'px_width',
 'ram']

Let us see feature importance values.

sel_.estimator_.feature_importances_

This gives the feature importance corresponding to each of these features.

array([0.10791525, 0.04683937, 0.05140157, 0.08051523, 0.08200777,
       0.63132081])

From this we can see that the Value at index 6 has the highest importance (0.6) which corresponds to ‘RAM’. According to RFE, RAM is the most important feature on deciding the price of the Mobile phone.

You can also visualize these feature importance values in the form of a histogram.

import matplotlib.pyplot as plt
%matplotlib inline
x=sel_.estimator_.feature_importances_.ravel()
plt.hist(x)
plt.grid()
plt.xlabel('Feature Importance')
plt.ylabel('Count')
plt.show()
Feature Importance Histogram RFE

In this way, we can select the important features from a random forest model using Recursive feature elimination. Note that RFE is a general method and you can use it with various machine learning algorithm and fit any model. The main downside of RFE is that it is time consuming.

2. Selecting Important feature using Random Forest and SelectFromModel

Here in SelectFromModel approach,  first we specify the random forest instance specifying the number of trees as we did in the previous section. Then we use the SelectFromModel method from Scikit-learn to automatically select the features.  SelectFromModel selects the features whose importance is greater than the mean importance of all features. You can alter this threshold value by passing it as an argument to SelectFromModel.

We follow the same steps that we followed for the RFE till the model fitting. So we can use the same code as below.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

data = pd.read_csv('D:\\mobile_price.csv')
data.shape
data.head()
X=pd.DataFrame(data.iloc[:,:-1])
y=pd.DataFrame(data.iloc[:,-1])
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=0)
print(f'X_train shaape : {X_train.shape}\nX_test shape{X_test.shape}')
Using SelectFromModel to Select the Features :

We import the SelectFromModel from sklearn and pass our random forest to it along with number of trees to be created.

from sklearn.feature_selection import SelectFromModel
sel_ = SelectFromModel(RandomForestClassifier(n_estimators=100))

Now fit it with X_train and y_train.

sel_.fit(X_train, y_train)

The output of the code will be:

Selecting Features using Select From Model and Randomforest

As mentioned earlier, Scikit-learn will select those features whose importance values are greater than  mean of all the coefficients. Let us have a look:

sel_.get_support()
array([ True, False, False, False, False, False, False, False, False,
       False, False,  True,  True,  True, False, False, False, False,
       False, False])

To verify the above statement about the number of features selected, Let us compare the number of selected features with the total number of features whose importance is greater the mean importance

print('Total features: {}'.format((X_train.shape[1])))
print('Selected features: {}'.format(len(selected_feat)))
print('Features with coefficients greater than the mean coefficient: {}'.format(
    np.sum(sel_.estimator_.feature_importances_ > sel_.estimator_.feature_importances_.mean())))

The output of the above code is:

Total features: 20
Selected features: 4
Features with coefficients greater than the mean coefficient: 4

Let us see the columns present and the corresponding coefficient values.

print(f'Features Present in the Dataframe: {data.columns[:-1]}')
print(f'Coefficient value for each Feature: {sel_.estimator_.feature_importances_}')

From above coefficient values, we can see that the Most Important Feature in determining the Mobile Price is “RAM”.

Features Present in the Dataframe: Index(['battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc', 'four_g',
       'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_height',
       'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g',
       'touch_screen', 'wifi'],
      dtype='object')
Coefficient value for each Feature: [0.07430381 0.00712549 0.02922127 0.00733069 0.02398714 0.00651424
 0.04030257 0.02515735 0.03951559 0.02501331 0.03026259 0.0575251
 0.06146919 0.46180555 0.02938165 0.02811903 0.03281924 0.00585079
 0.00746916 0.00682625]

Now let us see the features selected by using the below code.

selected_feat = X_train.columns[(sel_.get_support())]
print(list(selected_feat))

The selected features are printed as below:

['battery_power', 'px_height', 'px_width', 'ram']

We can visualize the coefficient values as a histogram:

import matplotlib.pyplot as plt
%matplotlib inline
x=sel_.estimator_.feature_importances_.ravel()
plt.hist(x)
plt.grid() 
plt.xlabel('Feature Importance')
plt.ylabel('Count')
plt.show()
Histogram SelectFromModel RandomForest

Conclusion

Feature selection using tree based approaches like random forest are generally a very useful technique for improving the model performance. Also, since this approach is straight forward, fast and a good way of selecting important features for machine learning most of the data scientists prefer this approach. So with this we hope you got a clear understanding on feature selection using random forest and how to use Recursive Feature Elimination. 

Leave a Reply

Your email address will not be published. Required fields are marked *