Exploratory Data Analysis on Goodreads: Books Dataset

Exploratory Data Analysis (EDA) in data analytics helps to visualize hidden and meaningful information inside the data.  Before modeling activity, it is always recommended to do the EDA because this reveals the important characteristics and helps to understand the behavior.Once the data  visualized, understood, the further steps in the data analysis like data cleaning and model building can be planned effectively to meet the business outcomes. Further more, helps to make quick conclusions  as better decisions are made when the data is represented visually than just a collection of numbers. In this article, we will see how to perform exploratory data analysis using matplotlib and seaborn to derive some insights of the data.

Dataset

For this EDA (Exploratory Data Analysis) task, we use Goodreads-books dataset. You can download the dataset from kaggle or from here. The dataset contains  around 13000 rows and features including Title, author, reviews,.. etc. Here our objective is to get some useful information and get a summary of this large volume of data.

Importing Libraries and Loading Data

The data set is present in D drive of my local machine. So we can directly read the data. If the data contains any discrepancies it can be avoided when you read the data itself using error_bad_lines=False. . The data is loaded using the below code.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

books=pd.read_csv("D://books.csv", error_bad_lines=False)

The first 2 rows are:

books.head(2)
bookID
title
authors
average_rating
isbn
isbn13
language_code
#num_pages
ratings_count
text_reviews_count
1
Harry Potter and the Half-Blood Prince
J.K. Rowling-Mary GrandPré
4.56
0439785960
9780439785969
eng
652
1944099
26249
2
Harry Potter and the Order of the Phoenix
J.K. Rowling-Mary GrandPré
4.49
0439358078
9780439358071
eng
870
1996446
27613
Shape (rows x columns) of the dataframe.
books.shape()
(13714, 10)
The dataframe has 13714 columns and 10 columns.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13714 entries, 0 to 13713
Data columns (total 10 columns):
bookID                13714 non-null int64
title                 13714 non-null object
authors               13714 non-null object
average_rating        13714 non-null float64
isbn                  13714 non-null object
isbn13                13714 non-null int64
language_code         13714 non-null object
# num_pages           13714 non-null int64
ratings_count         13714 non-null int64
text_reviews_count    13714 non-null int64
dtypes: float64(1), int64(5), object(4)
memory usage: 1.0+ MB
Let us see how many unique authors are present in the dataframe.
n_authors=books['authors'].nunique()
print(f'No of Unique Authors : {n_authors}')
No of Unique Authors : 7600

Exploratory Data Analysis

For the Exploratory Data Analysis, we employ matplotlib and seaborn libraries, which are very powerful and popular for data visualizations.

Checking the Correlation

f,ax = plt.subplots(figsize=(16, 8))
sns.heatmap(books.corr(), annot=True, linewidths=0.6, fmt= '.2f',ax=ax)
Correlation Heatmap Seaborn

Heatmaps are a good way to visually represent the correlation among the features. In the above heatmap, we can see there is a good positive correlation between ratings_count and text_reviews_count.

We can visualize this further using a scatterplot.

plt.figure(figsize=(16, 8))
sns.set()
ax = sns.scatterplot(x="ratings_count", y="text_reviews_count", data=books)
Scatterplot Seaborn

From the heatmap itself we have seen that it has a good positive correlation of 0.86, Now the scatterplot also shows a positive trend which means generally when the rating count increases, the number of text reviews also increase which is  evident from the plots. 

Let us now see if this relation holds between the number of pages and average rating. We can directly plot a scatter plot taking these two features.

plt.figure(figsize=(16, 8))
sns.set()
ax = sns.scatterplot(x="average_rating", y="# num_pages", data=books)
ax.set(xlabel='Average Rating of Books', ylabel='Total Number of Pages')
Seaborn Scatterplot

If you observe the plot, you can see there is no clear relationship that you can see and most of the books have rating between 3 and 4.8.  Let us view this in a different plot more precisely.

Distribution of Average Rating:

Let us now see the distribution of the average rating of books.

f,ax = plt.subplots(figsize=(10, 5))
sns.distplot(a=books['average_rating'], kde=False)
sns.despine()
sns.despine(left=True, bottom=True)
ax.set(xlabel='Average Rating')
Distribution of Average Rating Seaborn

As mentioned earlier, the majority of books appear to have average ratings between 3 to 4.5. we can see that clearly from the above plot.

Most Rated Books of an Author :

We can visualize the most rated books of a particular author. Let’s say Agatha Christie using the below code.
f,ax = plt.subplots(figsize=(12, 6))
author_agatha = books['authors']=='Agatha Christie'
agatha_books = books[author_agatha]
ratings_count= agatha_books.groupby('title')['ratings_count'].sum().reset_index().sort_values('ratings_count',
ascending=False).head(5)
sns.barplot(y=ratings_count['title'],x=ratings_count['ratings_count'])
Seaborn Visualization Top Rated Books

Top Authors:

plt.figure(figsize=(16, 8))
sns.set_palette("Paired")
plt.title("Authors and Number of Books Written")
ax=sns.countplot(x = "authors",order=books['authors'].value_counts().index[0:5],data=books)
ax.grid()
ax.set(xlabel='Author Name', ylabel='Number of Books')
for i in ax.patches:
ax.text(i.get_x()+.3, i.get_height()+0.3, str(i.get_height()), fontsize = 12, color = 'k')

Note that, seaborn gives so much customization. Here we have set the color palette as “Paired” and we used  countplot() which shows the counts of observations in each categorical bin using bars. The stylee=”darkgrid” gives the grid on the graph. By default, the value will not be shown on top of each category, so a custom code to add the text  is included. The three arguments for the text method are position of x, position of y, text string respectively. As per the above code, we visualize the top 5 authors with maximum number of books.

Seaborn Counter Plot for Number of Authors with Maximum No. of Books

Top 3 Books with Maximum Rating Counts:

We can visualize the books with maximum number of reviews given by the readers. For this we can use a horizontal barplot.

view_rating= books.sort_values('ratings_count', ascending = False).head(3).set_index('title')
fig, ax = plt.subplots(figsize=(10, 8))
sns.set_palette("YlGn")

sns.barplot(view_rating['ratings_count'], view_rating.index, label="Total",orient="h")
ax.set(ylabel="Books",xlabel="Rating Counts")
ax.set_title('Top 5 Books Based with Maximum Number of Rating')

For this we sort the dataframe on the descending order of rating_count and select the top 3 rows and set the title as index for that. We use barplot() method to plot the horizontal bar plot. The orient argument defines the orientation of the plot (horizontal/vertical) and it is optional. This is usually inferred from the dtype of the input variables

Seaborn Horizontal Barplot

Analyzing Books by Language

First of all, let us see how many different languages are present in the dataset.

print(books['language_code'].unique())

The unique language list is:

['eng' 'en-US' 'spa' 'fre' 'en-GB' 'mul' 'ger' 'ara' 'por' 'grc' 'en-CA'
 'enm' 'jpn' 'dan' 'zho' 'nl' 'ita' 'lat' 'srp' 'rus' 'tur' 'msa' 'swe'
 'glg' 'cat' 'wel' 'heb' 'nor' 'gla' 'ale']

Now we plot the language against the number of books written.

f,ax = plt.subplots(figsize=(10, 5))
langs = books['language_code'].value_counts().head(5)
sns.barplot(x=langs, y=langs.index)
ax.set(xlabel='Number of Books',ylabel='Language Code')
Seaborn Visualization Top Rated Books

The graph shows that English is the most common language books are written in.  There are different variants of English like American English, British English etc. are considered as separate category here. We can combine all these together to visualize this better.

plt.subplots(figsize=(10, 8))
plt.rcParams['xtick.color'] = '#909090'
labels = ['English', 'Other Languages']
eng_books = books[(books['language_code'] == 'eng') | (books['language_code'] == 'en-US') | (books['language_code'] == 'en-GB')
               | (books['language_code'] == 'en-CA')]
sizes = [eng_books.shape[0], books.shape[0]]
explode=(0.05, 0)
plt.pie(sizes, labels=labels, explode=explode, textprops=dict(fontsize=16), autopct='%1.0f%%', shadow=True, startangle=90)
plt.title('English vs Other Languages', fontsize=20, fontweight='bold')
Pie Chart Seaborn

Bonus : Word Cloud

Word Clouds are an effective technique for representing text data where size of each word shows the importance. Now we will see the distribution of different languages using a word cloud. For this, we need to import the wordcloud library.  You can install this using pip as follows:

pip install wordcloud

Now we can directly import wordcloud into your code and set the attributes like source data, height, width, etc. as shown below.

from wordcloud import WordCloud
wordcloud = (WordCloud(width=900, height=600, relative_scaling=0.4).generate_from_frequencies(books['language_code'].value_counts()))
fig = plt.figure(figsize=(13, 13))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
Python Wordcloud to Visualize Text

Conclusion

we’re all done visualizing  our good reads book dataset. Here we have extensively used seaborn to create some good quality charts. You can add more charts to this to get better understanding of the exploratory data analysis (EDA) and we hope this task was helpful to you in learning the data visualization.

Leave a Reply

Your email address will not be published. Required fields are marked *