In data science projects, while working with classification problems it is quite common to encounter imbalanced data. Also, when you evaluate the model using metrics like accuracy_score(), you may get good accuracy since the the model may predict the dominant class (Class - No) really well even though it fails to predict the class with less numbers. This can be dangerous in case of medical use cases. This article is intended to help you understand how data imbalance can be handled when you work in real-world use cases in data science.

Regression and Classification are the main two types of supervised machine learning algorithms. The main difference between classification and regression is that, regression predicts the continuous output and classification predicts the discrete output values. You can think of predicting if a person has diabetics or not as a classification problem and predicting the oil price as a regression problem. Regression is useful when you want to estimate a continuous output value using a set of predictors (inputs) . Here we will see how we can implement linear regression using python and scikit-learn with the help of Advertisement dataset.