# Day 1- Live EDA And Feature Engineering

This blog is a part of a multi blog series in which we will cover the following:

• Exploratory Data Analysis
• Missing Data and Outlier handling
• Feature Engineering and Feature Selection

### Why EDA?

We all know that extensive EDA is an integral part of the Data Science lifecycle, but, why do we need to perform EDA. A simple answer to this is get to know our data better and to have an intuitive understanding about our data. By performing EDA we are able to explore in what cases will our model give the right prediction and when not if you are considering a classification problem. Also, the insights gathered from our data can be used to conform with what the Subject Matter Experts have to say, whether the data conforms with their years of experience or not.

https://github.com/krishnaik06/5-Days-Live-EDA-and-Feature-Engineering

In today’s session we will be taking up the zomato dataset and will be performing exploratory data analysis and feature engineering to make it fit for the model to be trained.

Here are the topics that will get covered in today’s session

• What is EDA?
• Data-point/vector/Observation
• Data-set.
• Feature/Variable/Input-variable/InDependent-varibale
• Label/depdendent-variable/Output-varible/Class/Class-label/Response label
• Vector: 2-D, 3-D, 4-D,…. n-D
• Handling Missing Values
• Checking the relationship between missing values and dependent variable
• Analyzing Numerical Variables
• Analyzing Temporal Variable(Datetime)
• Analysing categorical and discrete variables and its relationship with dependent feature
• Checking outliers using box plot
Here we are going consider the zomato dataset where our major aim is to analyse the dataset and come up with some important observations and conclusions.

Initially we will start with importing all the basic libraries like pandas, numpy,matplotlib and seaborn

We will also set up some properties of matplotlib for our visualized figures.

```				```

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
```
```
```				```

matplotlib.rcParams['figure.figsize']=(12, 6)
```
```

We will start importing the dataset and create our dataframe

```				```

```
```

You can check the basic info by using the below code

```				```

df.info()
df.describe()
```
```

In EDA we usually perform such as

2. Exploring the features in the datasets
3. Exploring the relationship between the features to come up with conclusions and observations
Lets go ahead and check the missing values with the below and we will also use visualization for the same

```				```

sns.heatmap(df.isnull(),yticklabels=False,
cbar=False,cmap='viridis')
```
```

The Yellow Points indicate the feature that are having the missing value. So from here we can definitely come to a conclusion that Cuisines feature has some missing values.

We can check how many rowas are missing using the below code

```				```

df.isnull().sum()
```
```

We can create a pie chart to check the deliveries or order based on country wise using the below code

```				```

country_names=final_df.Country.value_counts()
.index
country_val=final_df.Country.value_counts()
.values
## Pie Chart- Top 3 countries that uses zomato
plt.pie(country_val[:3],labels=country_names[:3]
,autopct='%1.2f%%')
```
```

We Can Plot a bar graph using the below where we compare aggregate rating awith rating color and rating text

```				```

ratings=final_df.groupby(['Aggregate rating','Rating color'
,'Rating text']).size().reset_index().
rename(columns={0:'Rating Count'})
import matplotlib
matplotlib.rcParams['figure.figsize']=(12, 6)
sns.barplot(x="Aggregate rating",
y="Rating Count",data=ratings)
```
```

To give a better feel We will go ahead in setting up colors for the graphs

```				```

sns.barplot(x="Aggregate rating",
y="Rating Count",hue='Rating color',data=ratings,
palette=['blue','red','orange','yellow',
'green','green'])
```
```

We will also go ahead and create a count plot to check the ratings

```				```

## Count plot
sns.countplot(x="Rating color",
data=ratings,
palette=['blue','red','orange','yellow'
,'green','green'])
```
```

You can go ahead and check the detailed analysis from the video where we solved many important KPI’s and made some quick observations