Handling Data Imbalance for Data Science Projects

In data science projects, while working with classification problems it is quite common to encounter imbalanced data. Also, when you evaluate the model using metrics like accuracy_score(), you may get good accuracy since the the model may predict the dominant class (Class - No) really well even though it fails to predict the class with less numbers. This can be dangerous in case of medical use cases. This article is intended to help you understand how data imbalance can be handled when you work in real-world use cases in data science.

Continue Reading →

Exploratory Data Analysis on Goodreads: Books Dataset

Exploratory Data Analysis (EDA) in data analytics helps to visualize hidden and meaningful information inside the data.Once the data visualized, understood, the further steps in the data analysis like data cleaning and model building can be planned effectively to meet the business outcomes. Further more, helps to make quick conclusions as better decisions are made when the data is represented visually than just a collection of numbers. In this article, we will see how to perform exploratory data analysis using matplotlib and seaborn to derive some insights of the data.

Continue Reading →

Feature Selection using Tree Based Method and Recursive Feature Elimination(RFE)

Decision Trees and Random Forests are two powerful tree based machine learning algorithms which are predominantly used by Data scientists. One of the remarkable advantage of using tree based algorithm is that they can be easily interpreted. Also this makes it straight forward to derive the importance of each variable on the decision making process of tree based approach. In simple words, In tree based methods it is easy to compute how much each variable contributes to that decision. In this article, we will see two approaches for feature selection using tree based models.

Continue Reading →

Understand and Implement Linear Regression Using Scikit-Learn

Regression and Classification are the main two types of supervised machine learning algorithms. The main difference between classification and regression is that, regression predicts the continuous output and classification predicts the discrete output values. You can think of predicting if a person has diabetics or not as a classification problem and predicting the oil price as a regression problem. Regression is useful when you want to estimate a continuous output value using a set of predictors (inputs) . Here we will see how we can implement linear regression using python and scikit-learn with the help of Advertisement dataset.

Continue Reading →

Generate Millions of Data in Seconds – Faker Module in Python

Generating large volume of data for various purpose is often a hectic job and consumes a lot of time. Especially in scenarios like testing your application with dummy data,filling database tables, running machine learning algorithms, performance testing of applications, etc. Here we are going to see an effective technique to generate a huge amount of data in seconds with a python library called ‘Faker’.

Continue Reading →