In data science projects, while working with classification problems it is quite common to encounter imbalanced data. Also, when you evaluate the model using metrics like accuracy_score(), you may get good accuracy since the the model may predict the dominant class (Class - No) really well even though it fails to predict the class with less numbers. This can be dangerous in case of medical use cases. This article is intended to help you understand how data imbalance can be handled when you work in real-world use cases in data science.
Exploratory Data Analysis (EDA) in data analytics helps to visualize hidden and meaningful information inside the data.Once the data visualized, understood, the further steps in the data analysis like data cleaning and model building can be planned effectively to meet the business outcomes. Further more, helps to make quick conclusions as better decisions are made when the data is represented visually than just a collection of numbers. In this article, we will see how to perform exploratory data analysis using matplotlib and seaborn to derive some insights of the data.
Breast cancer is a cancer that develops in breast cells and when we see the statistics of 2019 in U.S, About 1 in 8 U.S. women (about 12%) will develop invasive breast cancer over the course of her lifetime. In this article, we will see how to identify breast cancer using K-Nearest Neighbors algorithm.
Here we have listed some of the widely used and well-known datasets which are very handy and helpful in applying to your classification machine learning experiments.
Decision Trees and Random Forests are two powerful tree based machine learning algorithms which are predominantly used by Data scientists. One of the remarkable advantage of using tree based algorithm is that they can be easily interpreted. Also this makes it straight forward to derive the importance of each variable on the decision making process of tree based approach. In simple words, In tree based methods it is easy to compute how much each variable contributes to that decision. In this article, we will see two approaches for feature selection using tree based models.
Classification Algorithms are an integral part any machine learning interviews. Engineercshoice and technoaviyal have created a Classification Quiz to make you prepare for data science interviews
The Quiz contains 10 well designed questions in linear regression set by industry experts and If you are looking for a job/job change in data science, this will help you to prepare and ace the interview!
Regression and Classification are the main two types of supervised machine learning algorithms. The main difference between classification and regression is that, regression predicts the continuous output and classification predicts the discrete output values. You can think of predicting if a person has diabetics or not as a classification problem and predicting the oil price as a regression problem. Regression is useful when you want to estimate a continuous output value using a set of predictors (inputs) . Here we will see how we can implement linear regression using python and scikit-learn with the help of Advertisement dataset.
Generating large volume of data for various purpose is often a hectic job and consumes a lot of time. Especially in scenarios like testing your application with dummy data,filling database tables, running machine learning algorithms, performance testing of applications, etc. Here we are going to see an effective technique to generate a huge amount of data in seconds with a python library called ‘Faker’.