eda

Day 1- Live EDA And Feature Engineering

This blog is a part of a multi blog series in which we will cover the following: 

  • Exploratory Data Analysis
  • Missing Data and Outlier handling
  • Feature Engineering and Feature Selection

Why EDA?

    We all know that extensive EDA is an integral part of the Data Science lifecycle, but, why do we need to perform EDA. A simple answer to this is get to know our data better and to have an intuitive understanding about our data. By performing EDA we are able to explore in what cases will our model give the right prediction and when not if you are considering a classification problem. Also, the insights gathered from our data can be used to conform with what the Subject Matter Experts have to say, whether the data conforms with their years of experience or not.

Download the dataset given below

https://github.com/krishnaik06/5-Days-Live-EDA-and-Feature-Engineering

In today’s session we will be taking up the zomato dataset and will be performing exploratory data analysis and feature engineering to make it fit for the model to be trained. 

Here are the topics that will get covered in today’s session

  • What is EDA?
  • Data-point/vector/Observation
  • Data-set.
  • Feature/Variable/Input-variable/InDependent-varibale
  • Label/depdendent-variable/Output-varible/Class/Class-label/Response label
  • Vector: 2-D, 3-D, 4-D,…. n-D
  • Handling Missing Values
  • Checking the relationship between missing values and dependent variable
  • Analyzing Numerical Variables
  • Analyzing Temporal Variable(Datetime)
  • Analysing categorical and discrete variables and its relationship with dependent feature
  • Checking outliers using box plot
Here we are going consider the zomato dataset where our major aim is to analyse the dataset and come up with some important observations and conclusions.

Download the dataset given from the github link given above

Initially we will start with importing all the basic libraries like pandas, numpy,matplotlib and seaborn

We will also set up some properties of matplotlib for our visualized figures.

				
					
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
				
			
				
					
matplotlib.rcParams['figure.figsize']=(12, 6)
				
			

We will start importing the dataset and create our dataframe

				
					
df=pd.read_csv('zomato.csv',encoding='latin-1')
df.head()
				
			

You can check the basic info by using the below code

				
					
df.info()
df.describe()
				
			
analysis

In EDA we usually perform such as

  1. Exploring About Missing values
  2. Exploring the features in the datasets
  3. Exploring the relationship between the features to come up with conclusions and observations 
Lets go ahead and check the missing values with the below and we will also use visualization for the same

				
					
sns.heatmap(df.isnull(),yticklabels=False,
cbar=False,cmap='viridis')
				
			
missing values

The Yellow Points indicate the feature that are having the missing value. So from here we can definitely come to a conclusion that Cuisines feature has some missing values.

We can check how many rowas are missing using the below code

 

				
					
df.isnull().sum()
				
			

We can create a pie chart to check the deliveries or order based on country wise using the below code

 

				
					
country_names=final_df.Country.value_counts()
.index
country_val=final_df.Country.value_counts()
.values
## Pie Chart- Top 3 countries that uses zomato
plt.pie(country_val[:3],labels=country_names[:3]
,autopct='%1.2f%%')
				
			

We Can Plot a bar graph using the below where we compare aggregate rating awith rating color and rating text

 

				
					
ratings=final_df.groupby(['Aggregate rating','Rating color'
,'Rating text']).size().reset_index().
rename(columns={0:'Rating Count'})
import matplotlib
matplotlib.rcParams['figure.figsize']=(12, 6)
sns.barplot(x="Aggregate rating",
y="Rating Count",data=ratings)
				
			
bar plot

To give a better feel We will go ahead in setting up colors for the graphs

 

				
					
sns.barplot(x="Aggregate rating",
y="Rating Count",hue='Rating color',data=ratings,
palette=['blue','red','orange','yellow',
'green','green'])
				
			
color bar

We will also go ahead and create a count plot to check the ratings

 

				
					
## Count plot
sns.countplot(x="Rating color",
data=ratings,
palette=['blue','red','orange','yellow'
,'green','green'])
				
			
count

You can go ahead and check the detailed analysis from the video where we solved many important KPI’s and made some quick observations

You can download the code from the github url

Download Code

Leave a Reply