The post Detailed Salary Analysis Of Data Scientist appeared first on Krish Naik.

]]>Here’s a more detailed salary analysis for data scientists:

Experience: Experience is a significant factor that can influence the salary of a data scientist. A data scientist with 1-3 years of experience can expect to earn an average salary of around

**$89,000**per year. With 4-6 years of experience, the average salary increases to approximately**$115,000**per year. A data scientist with 7-9 years of experience can earn an average salary of around**$132,000**per year.Location: The location of a data scientist can also have a significant impact on their salary. Data scientists working in cities with a high cost of living such as San Francisco, New York, and Boston can expect to earn higher salaries than those working in cities with a lower cost of living. According to Glassdoor, the average salary for a data scientist in San Francisco is

**$135,000**per year, while the average salary for a data scientist in Austin, Texas is $**95,000**per year.Industry: The industry in which a data scientist works can also impact their salary. Data scientists working in finance, healthcare, and technology tend to earn higher salaries than those working in other industries. According to Indeed, the average salary for a data scientist in the finance industry is

**$116,279**per year, while the average salary for a data scientist in the healthcare industry is**$110,742**per year.Company size: The size of the company can also impact the salary of a data scientist. Data scientists working for larger companies tend to earn higher salaries than those working for smaller companies. According to Glassdoor, the average salary for a data scientist at Google is

**$148,000**per year, while the average salary for a data scientist at a startup with fewer than 50 employees is**$96,000**per year.Skills: The skills a data scientist possesses can also have an impact on their salary. Data scientists with skills in machine learning, artificial intelligence, and deep learning tend to earn higher salaries than those without those skills. According to PayScale, the average salary for a data scientist with skills in machine learning is

**$107,000**per year, while the average salary for a data scientist with skills in deep learning is**$120,000**per year.

Overall, the salary of a data scientist can vary depending on a variety of factors. However, the average salary for a data scientist in the United States is around **$113,000** per year, according to Glassdoor.

Here’s a country-wise breakdown of data scientist salaries:

United States: The average salary for a data scientist in the United States is around

**$113,000**per year, according to Glassdoor. However, salaries can vary depending on factors such as experience, location, and industry. For example, the average salary for a data scientist in San Francisco is**$135,000**per year, while the average salary for a data scientist in Austin, Texas is $95,000 per year.Canada: The average salary for a data scientist in Canada is around CAD

**85,000**per year, according to Glassdoor. Salaries can vary depending on factors such as experience, location, and industry. For example, the average salary for a data scientist in Toronto is CAD**94,000**per year, while the average salary for a data scientist in Vancouver is CAD 77,000 per year.India: The average salary for a data scientist in India is around INR

**700,000**per year, according to Glassdoor. Salaries can vary depending on factors such as experience, location, and industry. For example, the average salary for a data scientist in Bangalore is around INR**1,100,000**per year, while the average salary for a data scientist in Mumbai is around INR**900,000**per year.

It’s important to note that these are averages and individual salaries can vary significantly based on the aforementioned factors.

The post Detailed Salary Analysis Of Data Scientist appeared first on Krish Naik.

]]>The post Python Lambda Function With Real World Examples appeared first on Krish Naik.

]]>In Python, a lambda function (also called an anonymous function) is a small, anonymous function that can be defined in a single line of code without a name. It is useful when we need a simple function that we don’t want to define explicitly using the `def`

keyword.

The basic syntax for a lambda function in Python is:

` ````
```lambda arguments: expression

Here, `arguments`

refer to the input arguments for the function and `expression`

is a single expression that gets evaluated and returned as the result of the function. The result of the expression is automatically returned by the lambda function, so there’s no need to use the `return`

statement.

For example, the following code defines a lambda function that takes two arguments and returns their sum:

` ````
```
f = lambda x, y: x + y

We can then call the function `f`

like any other function, passing in the required arguments:

` ````
```result = f(3, 4)
print(result) # Output: 7

Lambda functions can be used in many contexts where a small, one-time-use function is needed, such as in the `map()`

, `filter()`

, and `reduce()`

functions, or as a key function in the `sorted()`

function.

Let see some more complex examples

- Multiply two numbers:

` ````
```
multiply = lambda x, y: x * y
result = multiply(3, 4)
print(result) # Output: 12

- Get the length of a string:

` ````
```string_length = lambda s: len(s)
result = string_length("hello world")
print(result) # Output: 11

- Convert a list of integers to their corresponding square values:

` ````
```numbers = [1, 2, 3, 4, 5]
squares = list(map(lambda x: x**2, numbers))
print(squares) # Output: [1, 4, 9, 16, 25]

- Filter out even numbers from a list:

` ````
```numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
even_numbers = list(filter(lambda x: x % 2 == 0, numbers))
print(even_numbers) # Output: [2, 4, 6, 8, 10]

- Sort a list of strings based on their length:

` ````
```fruits = ['apple', 'banana', 'cherry', 'date', 'elderberry']
sorted_fruits = sorted(fruits, key=lambda x: len(x))
print(sorted_fruits) # Output: ['date', 'apple', 'banana', 'cherry', 'elderberry']

Note that in each of these examples, we define a lambda function on the fly and use it as needed, without assigning it a name. This is one of the key benefits of lambda functions, as they allow us to write concise and readable code without cluttering the namespace with unnecessary function names.

Here are some complex examples that demonstrate how lambda functions can be used in real-world scenarios:

- Sorting a list of dictionaries based on a specific key

` ````
```people = [
{'name': 'Alice', 'age': 25, 'occupation': 'Engineer'},
{'name': 'Bob', 'age': 30, 'occupation': 'Manager'},
{'name': 'Charlie', 'age': 22, 'occupation': 'Intern'},
{'name': 'Dave', 'age': 27, 'occupation': 'Designer'},
]
sorted_people = sorted(people, key=lambda x: x['age'])
print(sorted_people)

Output:

` ````
```[ {'name': 'Charlie', 'age': 22, 'occupation': 'Intern'}, {'name': 'Alice', 'age': 25, 'occupation': 'Engineer'}, {'name': 'Dave', 'age': 27, 'occupation': 'Designer'}, {'name': 'Bob', 'age': 30, 'occupation': 'Manager'}]

- Filtering a list of files based on their extension

` ````
```files = ['document.txt', 'picture.jpg', 'report.pdf', 'notes.txt', 'data.csv']
text_files = list(filter(lambda x: x.endswith('.txt'), files))
print(text_files)

Output:

` ````
```['document.txt', 'notes.txt']

The post Python Lambda Function With Real World Examples appeared first on Krish Naik.

]]>The post All You Need To Know About ChatGP appeared first on Krish Naik.

]]>ChatGPT (Generative Pre-trained Transformer) is a chatbot launched by OpenAI in November 2022. It is built on top of OpenAI’s GPT-3.5 family of large language models, and is fine-tuned with both supervised and reinforcement learning techniques.

ChatGPT was launched as a prototype on November 30, 2022, and quickly garnered attention for its detailed responses and articulate answers across many domains of knowledge. Its uneven factual accuracy was identified as a significant drawback

GPT (short for “Generative Pre-training Transformer”) is a type of language model developed by OpenAI. It is a machine learning model that is trained to generate natural language text that is coherent and sounds like it was written by a human.

There are several steps involved in training a GPT model:

Collect and preprocess a large dataset of text. This can be done by web scraping, using a publicly available dataset, or creating your own dataset. The text should be cleaned and normalized to make it easier for the model to process.

Choose a model architecture and set the hyperparameters. GPT models are based on the transformer architecture, and there are many choices to be made when setting up the model, such as the number of layers, the size of the hidden state, and the type of attention mechanism to use.

Train the model on the dataset. This involves feeding the text data to the model and optimizing the model’s parameters to minimize the loss function. This can be done using a variety of optimization algorithms, such as Adam or SGD.

Evaluate the model on a held-out test set. This will give you an idea of how well the model is able to generalize to unseen data.

Fine-tune the model for a specific task. Once the model is trained on a large dataset, it can be fine-tuned for a specific task, such as translation or language generation, by training it on a smaller dataset specific to that task.

The post All You Need To Know About ChatGP appeared first on Krish Naik.

]]>The post Introduction And Roadmap To Learn Natural Language Processing appeared first on Krish Naik.

]]>In this video and blog we are going to discuss the roadmap to learn NLP. This blog give you an idea about the topics and technqiues you need to know to become an NLP Engineer.

Below the is the complete hand written for making you understand the introduction and roadmap to learn NLP.

NLP-Ir8y0.9t4?63d32fqo3q

The post Introduction And Roadmap To Learn Natural Language Processing appeared first on Krish Naik.

]]>The post Understanding All Optimizers In Deep Learning appeared first on Krish Naik.

]]>Many people may be using optimizers while training the neural network without knowing that the method is known as optimization. Optimizers are algorithms or methods used to change the attributes of your neural network such as weights and learning rates in order to reduce the losses.

How we should change your weights or learning rates of your neural network to reduce the losses is defined by the optimizer you use.

Optimization algorithm or strategies are responsible for reducing the losses and to provide the most accurate results possible.

In this blog, we learn about different types of optimization algorithms and their advantages.

Gradient descent is an iterative machine learning optimization algorithm to reduce the cost function. This will help models to make accurate predictions.

Gradient descent the direction of increase. As we want to find the minimum point in the valley we need to go in the opposite direction of the gradient. We update parameters n the negative gradient direction to minimize the loss.

Advantages:

a. Easy Computation.

b. Easy to Implement

c. Easy to understand

Disadvantages:

a. May trap at local minima.

b. Weights are changed after calculating the gradient on the whole dataset. so, if the datasets are too large then this may take years to converge to the minima.

Different types of Gradient descent are:

1. Batch Gradient Descent

2. Stochastic Gradient Descent

3. Mini batch Gradient Descent

In the batch gradient, we use the entire dataset to compute the gradient of the cost function for each iteration of the gradient descent and then update the weights.

Since we use the entire dataset to compute the gradient convergence is slow.

If the dataset is huge and contains millions or billions of data points then it is memory as well as computationally intensive.

Advantages of Batch Gradient Descent

- Theoretical analysis of weights and convergence rates are easy to understand

Disadvantages of Batch Gradient Descent

- Perform redundant computation for the same training example for large datasets
- Can be very slow and intractable as large datasets may not fit in the memory
- As we take the entire dataset for computation we can update the weights of the model for the new data.

It’s a variant of Gradient Descent. It tries to update the model’s parameters more frequently. In this, the model parameters are altered after the computation of loss on each training example. So, if the dataset contains 1000 rows SGD will update the model parameters 1000 times in one cycle of a dataset instead of one time as in Gradient Descent.

θ=θ−α⋅∇J(θ;x(i);y(i)) , where {x(i) ,y(i)} are the training examples.

As the model parameters are frequently updated parameters have high variance and fluctuations in loss functions at different intensities.

Advantages of SGD:

- Frequent updates of model parameters hence converge in less time.
- Requires less memory as no need to store values of loss functions.
- May get new minima’s.

Disadvantages:

- High variance in model parameters.
- May shoot even after achieving global minima.
- To get the same convergence as gradient descent needs to slowly reduce the value of the learning rate.

Mini-batch gradient is a variation of gradient descent where the batch size consists of more than one and less than the total dataset. Mini batch gradient descent is widely used and converges faster and is more stable. The batch size can vary depending on the dataset. As we take a batch with different samples, it reduces the noise which is the variance of the weight updates and this helps to have a more stable and faster convergence.

Just using gradient descent we can not fulfill our thirst. Here Optimizer comes in. Optimizers shape and mold our model into its most accurate possible form by updating the weights. The loss function guides the optimizer by telling it whether it is moving in the right direction to reach the bottom of the valley, the global minimum.

Advantages:

- Frequently updates the model parameters and also has less variance.
- Requires medium amount of memory.

All types of Gradient Descent have some challenges:

- Choosing an optimum value of the learning rate. If the learning rate is too small then gradient descent may take ages to converge.
- Have a constant learning rate for all the parameters. There may be some parameters that we may not want to change at the same rate.
- May get trapped at local minima.

It’s difficult to overstate how popular gradient descent really is, and it’s used across the board even up to complex neural net architectures (backpropagation is basically gradient descent implemented on a network). There are other types of optimizers based on gradient descent that are used though, and here are a few of them:

Another optimization strategy is called AdaGrad. The idea is that you keep the running sum of squared gradients during optimization. In this case, we have no momentum term, but an expression g that is the sum of the squared gradients.

When we update a weight parameter, we divide the current gradient by the root of that term g. To explain the intuition behind AdaGrad, imagine a loss function in a two-dimensional space in which the gradient of the loss function in one direction is very small and very high in the other direction.

Summing up the gradients along the axis where the gradients are small causes the squared sum of these gradients to become even smaller. If during the update step, we divide the current gradient by a very small sum of squared gradients g, the result of that division becomes very high and vice versa for the other axis with high gradient values.

As a result, we force the algorithm to make updates in any direction with the same proportions.

This means that we accelerate the update process along the axis with small gradients by increasing the gradient along that axis. On the other hand, the updates along the axis with the large gradient slow down a bit.

However, there is a problem with this optimization algorithm. Imagine what would happen to the sum of the squared gradients when training takes a long time. Over time, this term would get bigger. If the current gradient is divided by this large number, the update step for the weights becomes very small. It is as if we were using very low learning that becomes even lower the longer the training goes. In the worst case, we would get stuck with AdaGrad and the training would go on forever.

Advantages:

- Learning rate changes for each training parameter.
- Don’t need to manually tune the learning rate.
- Able to train on sparse data.

Disadvantages:

- Computationally expensive is a need to calculate the second-order derivative.
- The learning rate is always decreasing results in slow training.

It is an extension of AdaGrad which tends to remove the decaying learning Rate problem of it. Instead of accumulating all previously squared gradients, Adadelta limits the window of accumulated past gradients to some fixed size w. In this exponentially moving average is used rather than the sum of all the gradients.

E[g²](t)=γ.E[g²](t−1)+(1−γ).g²(t)

We set γ to a similar value as the momentum term, around 0.9.

Advantages:

- Now the learning rate does not decay and the training does not stop.

Disadvantages:

- Computationally expensive.

Adam (Adaptive Moment Estimation) works with momentums of first and second order. The intuition behind the Adam is that we don’t want to roll so fast just because we can jump over the minimum, we want to decrease the velocity a little bit for a careful search. In addition to storing an exponentially decaying average of past squared gradients like AdaDelta, Adam also keeps an exponentially decaying average of past gradients M(t).

M(t) and V(t) are values of the first moment which is the Mean and the second moment which is the uncentered variance of the gradients respectively.

Here, we are taking mean of M(t) and V(t) so that E[m(t)] can be equal to E[g(t)] where, E[f(x)] is an expected value of f(x).

To update the parameter:

The values for β1 is 0.9 , 0.999 for β2, and (10 x exp(-8)) for ‘ϵ’.

Advantages:

- The method is too fast and converges rapidly.
- Rectifies vanishing learning rate, high variance.

Disadvantages:

1. Computationally costly.

Finally, we can discuss the question of what the best gradient descent algorithm is.

In general, a normal gradient descent algorithm is more than adequate for simpler tasks. If you are not satisfied with the accuracy of your model you can try out RMSprop or add a momentum term to your gradient descent algorithms.

But in my experience the best optimization algorithm for neural networks out there is Adam. This optimization algorithm works very well for almost any deep learning problem you will ever encounter. Especially if you set the hyperparameters to the following values:

- β1=0.9
- β2=0.999
- Learning rate = 0.001–0.0001

… this would be a very good starting point for any problem and virtually every type of neural network architecture I’ve ever worked with.

That’s why Adam Optimizer is my default optimization algorithm for every problem I want to solve. Only in very few cases do I switch to other optimization algorithms that I introduced earlier.

In this sense, I recommend that you always start with the Adam Optimizer, regardless of the architecture of the neural network of the problem domain you are dealing with.

Adam is the best optimizer. If one wants to train the neural network in less time and more efficiently then Adam is the optimizer.

For sparse data use the optimizers with a dynamic learning rate.

If want to use a gradient descent algorithm then min-batch gradient descent is the best option.

I hope you guys liked the article and were able to give you a good intuition towards the different behaviors of different Optimization Algorithms.

Reference:

Image reference: Google

The post Understanding All Optimizers In Deep Learning appeared first on Krish Naik.

]]>The post Day 3- Data Science Interview Prepartion appeared first on Krish Naik.

]]>**Day3:** This post will help you to prepare for a Data Science interview (**30 days of Interview Preparation**) by covering everything from the fundamentals to the more advanced levels of job interview questions and answers. Let’s take a look:-

**Heteroscedasticity,Multicollinearity****Market basket analysis, Association Analysis****KNN Classifier, Principal Component Analysis(PCA)****T-SNE,Mean Absolute Error****long data vs wide data****Normalization vs Standardization**

**Q1. How do you treat heteroscedasticity in regression?**

**Answer:**

Heteroscedasticity means unequal scattered distribution. In regression analysis, we generally talk about the heteroscedasticity in the context of the error term. Heteroscedasticity is the systematic change in the spread of the residuals or errors over the range of measured values. Heteroscedasticity is the problem because *Ordinary least squares (OLS) regression* assumes that all residuals are drawn from a random population that has a constant variance.

**What causes Heteroscedasticity?**

Heteroscedasticity occurs more often in datasets, where we have a large range between the largest and the smallest observed values. There are many reasons why heteroscedasticity can exist, and a generic explanation is that the error variance changes proportionally with a factor.

We can categorize Heteroscedasticity into two general types:-

**Pure heteroscedasticity:**– It refers to cases where we specify the correct model and let us observe the non-constant variance in residual plots.

**Impure heteroscedasticity:**– It refers to cases where you incorrectly specify the model, and that causes the non-constant variance. When you leave an important variable out of a model, the omitted effect is absorbed into the error term. If the effect of the omitted variable varies throughout the observed range of data, it can produce the telltale signs of heteroscedasticity in the residual plots.

*How to Fix Heteroscedasticity*

**Redefining the variables:**

If your model is a cross-sectional model that includes large differences between the sizes of the observations, you can find different ways to specify the model that reduces the impact of the size differential. To do this, change the model from using the raw measure to using rates and per capita values. Of course, this type of model answers a slightly different kind of question. You’ll need to determine whether this approach is suitable for both your data and what you need to learn.

**Weighted regression:**

It is a method that assigns each data point to a weight based on the variance of its fitted value. The idea is to give small weights to observations associated with higher variances to shrink their squared residuals. Weighted regression minimizes the sum of the weighted squared residuals. When you use the correct weights, heteroscedasticity is replaced by homoscedasticity.

**Q2. What is multicollinearity, and how do you treat it?**

**Answer:**

Multicollinearity means independent variables are highly correlated to each other. In regression analysis, it’s an important assumption that the regression model should not be faced with a problem of multicollinearity.

If two explanatory variables are highly correlated, it’s hard to tell, which affects the dependent variable.Let’s say Y is regressed against X1 and X2 and where X1 and X2 are highly correlated. Then the effect of X1 on Y is hard to distinguish from the effect of X2 on Y because any increase in X1 tends to be associated with an increase in X2.

Another way to look at the multicollinearity problem is: Individual t-test P values can be misleading. It means a P-value can be high, which means the variable is not important, even though the variable is important.

**Correcting Multicollinearity:**

1) Remove one of the highly correlated independent variables from the model. If you have two or more factors with a high VIF, remove one from the model.

2) Principle Component Analysis (PCA) – It cut the number of interdependent variables to a smaller set of uncorrelated components. Instead of using highly correlated variables, use components in the model that have eigenvalue greater than 1.

3) Run PROC VARCLUS and choose the variable that has a minimum (1-R2) ratio within a cluster.

4) Ridge Regression – It is a technique for analyzing multiple regression data that suffer from multicollinearity.

5) If you include an interaction term (the product of two independent variables), you can also reduce multicollinearity by “centering” the variables. By “centering,” it means subtracting the mean from the values of the independent variable before creating the products.

When is multicollinearity not a problem?

1) If your goal is to predict Y from a set of X variables, then multicollinearity is not a problem. The predictions will still be accurate, and the overall R2 (or adjusted R2) quantifies how well the model predicts the Y values.

2) Multiple dummy (binary) variables that represent a categorical variable with three or more categories.

**Q3. What is market basket analysis? How would you do it in Python?**

**Answer:**

**Market basket analysis **is the study of items that are purchased or grouped in a single transaction or multiple, sequential transactions. Understanding the relationships and the strength of those relationships is valuable information that can be used to make recommendations, cross-sell, up-sell, offer coupons, etc.

**Market Basket Analysis** is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy.

**Q4. What is Association Analysis? Where is it used?**

**Answer:**

*Association analysis uses a set of transactions to discover rules that indicate the likely occurrence of an item based on the occurrences of other items in the transaction.*

The technique of association rules is widely used for retail basket analysis. It can also be used for classification by using rules with class labels on the right-hand side. It is even used for outlier detection with rules indicating infrequent/abnormal association.

Association analysis also helps us to identify cross-selling opportunities, for example, we can use the rules resulting from the analysis to place associated products together in a catalog, in the supermarket, or the Webshop, or apply them when targeting a marketing campaign for product B at customers who have already purchased product A.

Association rules are given in the form as below:

A=>B[Support,Confidence] The part before => is referred to as if (Antecedent) and the part after => is referred to as then (Consequent).

Where A and B are sets of items in the transaction data, a and B are disjoint sets.

Computer=>Anti−virusSoftware[Support=20%,confidence=60%] Above rule says:

1. 20% transaction show Anti-virus software is bought with purchase of a Computer

2. 60% of customers who purchase Anti-virus software is bought with purchase of a Computer

An example of Association Rules * Assume there are 100 customers

1. 10 of them bought milk, 8 bought butter and 6 bought both of them 2 .bought milk => bought butter

2. support = P(Milk & Butter) = 6/100 = 0.06

3. confidence = support/P(Butter) = 0.06/0.08 = 0.75

4. lift = confidence/P(Milk) = 0.75/0.10 = 7.5

**Q5. What is KNN Classifier ?**

**Answer:**

KNN means K-Nearest Neighbour Algorithm. It can be used for both classification and regression.

It is the simplest machine learning algorithm. Also known as lazy learning (why? Because it does not create a generalized model during the time of training, so the testing phase is very important where it does the actual job. Hence Testing is very costly – in terms of time & money). Also called an instancebased or memory-based learning

In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is assigned to the class of that single nearest neighbor.

In **k-NN** regression, the output is the property value for the object. This value is the average of the values of k nearest neighbors.

All three distance measures are only valid for continuous variables. In the instance of categorical variables, the Hamming distance must be used.

**How to choose the value of K:** K value is a hyperparameter which needs to choose during the time of model building

Also, a small number of neighbors are most flexible fit, which will have a low bias, but the high variance and a large number of neighbors will have a smoother decision boundary, which means lower variance but higher bias.

We should choose an odd number if the number of classes is even. It is said the most common values are to be 3 & 5.

**Q6. What is Pipeline in sklearn ?**

**Answer:**

A pipeline is what chains several steps together, once the initial exploration is done. For example, some codes are meant to transform features—normalize numerically, or turn text into vectors, or fill up missing data,and they are transformers; other codes are meant to predict variables by fitting an algorithm, such as random forest or support vector machine, they are estimators. Pipeline chains all these together,which can then be applied to training data in block.

Example of a pipeline that imputes data with the most frequent value of each column, and then fit a decision tree classifier.

From sklearn.pipeline import Pipeline

steps = [(‘imputation’, Imputer(missing_values=‘NaN’, strategy = ‘most_frequent’, axis=0)),

(‘clf’, DecisionTreeClassifier())]

pipeline = Pipeline(steps)

clf = pipeline.fit(X_train,y_train)“`

```
```

Instead of fitting to one model, it can be looped over several models to find the best one.

classifiers = [ KNeighborsClassifier(5), RandomForestClassifier(), GradientBoostingClassifier()]for clf in classifiers:

steps = [(‘imputation’, Imputer(missing_values=’NaN’, strategy = ‘most_frequent’, axis=0)),

(‘clf’, clf)]

pipeline = Pipeline(steps)

I also learned the pipeline itself can be used as an estimator and passed to cross-validation or grid search.

from sklearn.model_selection import KFold

from sklearn.model_selection import cross_val_score

kfold = KFold(n_splits=10, random_state=seed)

results = cross_val_score(pipeline, X_train, y_train, cv=kfold)

print(results.mean())

```
```

**Q7. What is Principal Component Analysis(PCA), and why we do?**

**Answer:**

The main idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of many variables correlated with each other, either heavily or lightly, while retaining the variation present in the dataset, up to the maximum extent. The same is done by transforming the variables to a new set of variables, which are known as the principal components (or simply, the PCs) and are orthogonal, ordered such that the retention of variation present in the original variables decreases as we move down in the order. So, in this way, the 1st principal component retains maximum variation that was present in the original components. The principal components are the eigenvectors of a covariance matrix, and hence they are orthogonal.

Main important points to be considered:

1. Normalize the data

2. Calculate the covariance matrix

3. Calculate the eigenvalues and eigenvectors

4. Choosing components and forming a feature vector

5. Forming Principal Components

**Q8. What is t-SNE?**

**Answer:**

(t-SNE) t-Distributed Stochastic Neighbor Embedding is a non-linear dimensionality reduction algorithm used for exploring high-dimensional data. It maps multi-dimensional data to two or more dimensions suitable for human observation. With the help of the t-SNE algorithms, you may have to plot fewer exploratory data analysis plots next time you work with high dimensional data.

**Q9. VIF(Variation Inflation Factor),Weight of Evidence & Information Value. Why and when to use?**

**Answer:**

**Variation Inflation Factor**

It provides an index that measures how much the variance (the square of the estimate’s standard deviation) of an estimated regression coefficient is increased because of collinearity.

VIF = 1 / (1-R-Square of j-th variable) where R2 of jth variable is the coefficient of determination of the model that includes all independent variables except the jth predictor.

Where R-Square of j-th variable is the multiple R2 for the regression of Xj on the other independent variables (a regression that does not involve the dependent variable Y).

If VIF > 5, then there is a problem with multicollinearity.

**Understanding VIF**

If the variance inflation factor of a predictor variable is 5 this means that variance for the coefficient of that predictor variable is 5 times as large as it would be if that predictor variable were uncorrelated with the other predictor variables.

In other words, if the variance inflation factor of a predictor variable is 5 this means that the standard error for the coefficient of that predictor variable is 2.23 times (√5 = 2.23) as large as it would be if that predictor variable were uncorrelated with the other predictor variables.

**Weight of evidence (WOE) and information value (IV)** are simple, yet powerful techniques to perform variable transformation and selection.

The formula to create WOE and IV is

Here is a simple table that shows how to calculate these values.

The IV value can be used to select variables quickly.

**Q10: How to evaluate that data does not have any outliers ?**

**Answer:**

In statistics, outliers are data points that don’t belong to a certain population. It is an abnormal observation that lies far away from other values. An outlier is an observation that diverges from otherwise well structured data.

**Detection:**

**Method 1 — Standard Deviation**: In statistics, If a data distribution is approximately normal, then about 68% of the data values lie within one standard deviation of the mean, and about 95% are within two standard deviations, and about 99.7% lie within three standard deviations.

Therefore, if you have any data point that is more than 3 times the standard deviation, then those points are very likely to be anomalous or outliers.

**Method 2 — Boxplots**: Box plots are a graphical depiction of numerical data through their quantiles. It is a very simple but effective way to visualize outliers. Think about the lower and upper whiskers as the boundaries of the data distribution. Any data points that show above or below the whiskers can be considered outliers or anomalous.

**Method 3 – Violin Plots:** Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator. Typically a violin plot will include all the data that is in a box plot: a marker for the median of the data, a box or marker indicating the interquartile range,and possibly all sample points if the number of samples is not too high.

**Method 4 – Scatter Plots:** A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.

The points which are very far away from the general spread of data and have a very few neighbors are considered to be outliers

**Q11: What you do if there are outliers?**

**Answer:**

Following are the approaches to handle the outliers:

1. Drop the outlier records

2. Assign a new value: If an outlier seems to be due to a mistake in your data, you try imputing a value.

3. If percentage-wise the number of outliers is less, but when we see numbers, there are several, then, in that case, dropping them might cause a loss in insight. We should group them in that case and run our analysis separately on them.

**Q12: What are the encoding techniques you have applied with Examples ?**

**Answer:**

In many practical data science activities, the data set will contain categorical variables. These variables are typically stored as text values”. Since machine learning is based on mathematical equations, it would cause a problem when we keep categorical variables as is.

Let’s consider the following dataset of fruit names and their weights.

Some of the common encoding techniques are:

**Label encoding:** In label encoding, we map each category to a number or a label. The labels chosen for the categories have no relationship. So categories that have some ties or are close to each other lose such information after encoding.

**One – hot encoding:** In this method, we map each category to a vector that contains 1 and 0 denoting the presence of the feature or not. The number of vectors depends on the categories which we want to keep. For high cardinality features, this method produces a lot of columns that slows down the learning significantly.

**Q13: Tradeoff between bias and variances, the relationship between them.**

**Answer:**

Whenever we discuss model prediction, it’s important to understand prediction errors (bias and variance).

The prediction error for any machine learning algorithm can be broken down into three parts:

** Bias Error**

** Variance Error**

**Irreducible Error**

The irreducible error cannot be reduced regardless of what algorithm is used. It is the error introduced from the chosen framing of the problem and may be caused by factors like unknown variables that influence the mapping of the input variables to the output variable

**Bias:** Bias means that the model favors one result more than the others. Bias is the simplifying assumptions made by a model to make the target function easier to learn. The model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to a high error in training and test data.

**Variance: **Variance is the amount that the estimate of the target function will change if different training data was used. The model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but have high error rates on test data.

So, the end goal is to come up with a model that balances both Bias and Variance. This is called Bias Variance Trade-off. To build a good model, we need to find a good balance between bias and variance such that it minimizes the total error.

**Q14: What is the difference between Type 1 and Type 2 error and severity of the error?**

**Answer:**

**Type I Error**

A Type I error is often referred to as a “false positive” and is the incorrect rejection of the true null hypothesis in favor of the alternative.

In the example above, the null hypothesis refers to the natural state of things or the absence of the tested effect or phenomenon, i.e., stating that the patient is HIV negative. The alternative hypothesis states that the patient is HIV positive. Many medical tests will have the disease they are testing for as the alternative hypothesis and the lack of that disease as the null hypothesis.

A Type I error would thus occur when the patient doesn’t have the virus, but the test shows that they do. In other words, the test incorrectly rejects the true null hypothesis that the patient is HIV negative.

**Type II Error**

A Type II error is the inverse of a Type I error and is the false acceptance of a null hypothesis that is not true, i.e., a false negative. A Type II error would entail the test telling the patient they are free of HIV when they are not.Considering this HIV example, which error type do you think is more acceptable? In other words, would you rather have a test that was more prone to Type I or Types II error? With HIV, the momentary stress of a false positive is likely better than feeling relieved at a false negative and then failing to take steps to treat the disease. Pregnancy tests, blood tests, and any diagnostic tool that has serious consequences for the health of a patient are usually overly sensitive for this reason – they should err on the side of a false positive.

But in most fields of science, Type II errors are seen as less serious than Type I errors. With the Type II error, a chance to reject the null hypothesis was lost, and no conclusion is inferred from a non-rejected null. But the Type I error is more serious because you have wrongly rejected the null hypothesis and ultimately made a claim that is not true. In science, finding a phenomenon where there is none is more egregious than failing to find a phenomenon where there is.

**Q15: What is binomial distribution and polynomial distribution?**

**Answer:**

**Binomial Distribution:** A binomial distribution can be thought of as simply the probability of a SUCCESS or FAILURE outcome in an experiment or survey that is repeated multiple times. The binomial is a type of distribution that has two possible outcomes (the prefix “bi” means two, or twice). For example, a coin toss has only two possible outcomes: heads or tails, and taking a test could have two possible outcomes: pass or fail.

**Multimonial/Polynomial Distribution:** Multi or Poly means many. In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts of each side for rolling a k-sided die n times. For n independent trials each of which leads to success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories.

**Q16: What is the Mean Median Mode standard deviation for the sample and population?**

**Answer:**

**Mean:**– It is an important technique in statistics. Arithmetic Mean can also be called an average. It is the number of the quantity obtained by summing two or more numbers/variables and then dividing the sum by the number of numbers/variables.

**Mode** :-The mode is also one of the types for finding the average. A mode is a number that occurs most frequently in a group of numbers. Some series might not have any mode; some might have two modes,which is called a bimodal series.

In the study of statistics, the three most common ‘averages’ in statistics are mean, median, and mode.

**Median **is also a way of finding the average of a group of data points. It’s the middle number of a set of numbers. There are two possibilities, the data points can be an odd number group, or it can be an even number group.

If the group is odd, arrange the numbers in the group from smallest to largest. The median will be the one which is exactly sitting in the middle, with an equal number on either side of it. If the group is even, arrange the numbers in order and pick the two middle numbers and add them then divide by 2. It will be the median number of that set.

**Standard Deviation (Sigma)** Standard Deviation is a measure of how much your data is spread out in statistics.

**Q17: What is Mean Absolute Error ?**

**Answer:-**

Absolute Error is the amount of error in your measurements. It is the difference between the measured value and the “true” value. For example, if a scale states 90 pounds,but you know your true weight is 89 pounds, then the scale has an absolute error of 90 lbs – 89 lbs = 1 lbs.

This can be caused by your scale, not measuring the exact amount you are trying to measure. For example, your scale may be accurate to the nearest pound. If you weigh 89.6 lbs, the scale may “round up” and give you 90 lbs. In this case the absolute error is 90 lbs – 89.6 lbs = .4 lbs.

Mean Absolute Error The Mean Absolute Error(MAE) is the average of all absolute errors. The formula is: mean absolute error

Where,

n = the number of errors, Σ = summation symbol (which means “add them all up”), |xi – x| = the absolute errors. The formula may look a little daunting, but the steps are easy:

Find all of your absolute errors, xi – x. Add them all up. Divide by the number of errors. For example, if you had 10 measurements, divide by 10.

**Q18: What is the difference between long data and wide data?**

**Answer:**

There are many different ways that you can present the same dataset to the world. Let’s take a look at one of the most important and fundamental distinctions, whether a dataset is wide or long.The difference between wide and long datasets boils down to whether we prefer to have more columns in our dataset or more rows.

Wide Data A dataset that emphasizes putting additional data about a single subject in columns is called a wide dataset because, as we add more columns, the dataset becomes wider.

Long Data Similarly, a dataset that emphasizes including additional data about a subject in rows is called a long dataset because, as we add more rows, the dataset becomes longer. It’s important to point out that there’s nothing inherently good or bad about wide or long data.

In the world of data wrangling, we sometimes need to make a long dataset wider, and we sometimes need to make a wide dataset longer. However, it is true that, as a general rule, data scientists who embrace the concept of tidy data usually prefer longer datasets over wider ones.

**Q19: What are the data normalization method you have applied, and why?**

**Answer:**

Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. For machine learning, every dataset does not require normalization. It is required only when features have different ranges.

In simple words, when multiple attributes are there, but attributes have values on different scales, this may lead to poor data models while performing data mining operations. So they are normalized to bring all the attributes on the same scale, usually something between (0,1).

It is not always a good idea to normalize the data since we might lose information about maximum and minimum values. Sometimes it is a good idea to do so.

* For example,* ML algorithms such as Linear Regression or Support Vector Machines typically converge faster on normalized data. But on algorithms like K-means or K Nearest Neighbours, normalization could be a good choice or a bad depending on the use case since the distance between the points plays a key role here.

**Types of Normalisation :**

* 1 Min-Max Normalization:* In most cases, standardization is used feature-wise

* 2 Z-score normalization *In this technique, values are normalized based on a mean and standard deviation of the data

v’, v is new and old of each entry in data respectively. σA, A is the standard deviation and mean of A respectively.

standardization (or Z-score normalization) is that the features will be rescaled so that they’ll have the properties of a standard normal distribution with

μ=0 and σ=1 where μ is the mean (average) and σ is the standard deviation from the mean; standard scores (also called z scores) of the samples are calculated as follows:

z=(x−μ)/σ

**Q20: What is the difference between normalization and Standardization with example?**

In ML, every practitioner knows that feature scaling is an important issue. The two most discussed scaling methods are** Normalization and Standardization.** Normalization typically means it rescales the values into a range of [0,1].

It is an alternative approach to Z-score normalization (or standardization) is the so-called Min-Max scaling (often also called “normalization” – a common cause for ambiguities). In this approach, the data is scaled to a fixed range – usually 0 to 1. Scikit-Learn provides a transformer called MinMaxScaler for this. A Min-Max scaling is typically done via the following equation:

**Xnorm = X-Xmin/Xmax-Xmin**

**Example with sample data: Before Normalization: **Attribute Price in Dollars Storage Space Camera

Attribute Price in Dollars Storage Space Camera

Mobile 1 250 16 12

Mobile 2 200 16 8

Mobile 3 300 32 16

Mobile 4 275 32 8

Mobile 5 225 16 16

**After Normalization: (Values ranges from 0-1 which is working as expected)**

Attribute Price in Dollars Storage Space Camera

Mobile 1 0.5 0 0.5

Mobile 2 0 0 0

Mobile 3 1 1 1

Mobile 4 0.75 1 0

Mobile 5 0.25 0 1

Standardization (or Z-score normalization) typically means rescales data to have a mean of 0 and a standard deviation of 1 (unit variance) Formula:** Z or X_new=(x−μ)/σ** where μ is the mean (average),and σ is the standard deviation from the mean; standard scores (also called z scores) Scikit-Learn provides a transformer called StandardScaler for standardization Example: Let’s take an approximately normally distributed set of numbers: 1, 2, 2, 3, 3, 3, 4, 4, and 5. Its mean is 3, and its standard deviation: 1.22. Now, let’s subtract the mean from all data points. we get a new data set of: -2, -1, -1, 0, 0, 0, 1, 1, and 2. Now, let’s divide each data point by 1.22. As you can see in the picture below, we get: -1.6, -0.82, -0.82, 0, 0, 0, 0.82, 0.82, and 1.63.

**If you are looking for affordable tech course such as data science, machine learning, deep learning,cloud and many more you can go ahead with** **iNeuron oneneuron platform where you will able to get 200+ tech courses at an affordable price for a lifetime access.**

The post Day 3- Data Science Interview Prepartion appeared first on Krish Naik.

]]>The post Day 2- Data Science Interview Prepartion appeared first on Krish Naik.

]]>This post will help you to prepare for a Data Science interview (**30 days of Interview Preparation**) by covering everything from the fundamentals to the more advanced levels of job interview questions and answers. Let’s take a look:-

**What is Logistic Regression****logistic vs linear regression****Decision Tree****Random Forest Algorithm****Ensemble Methods****SVM Classification, Naive Bayes Classification & Gaussian Naive Bayes,Confusion Matrix**

**Q1. What is Logistic Regression?**

**Ans:-**

The logistic regression technique involves the dependent variable, which can be represented in the binary (0 or 1, true or false, yes or no) values, which means that the outcome could only be in either one form of two. For example, it can be utilized when we need to find the probability of a successful or fail event.

__Model__

Output = 0 or 1

Z = WX + B

hΘ(x) = sigmoid (Z)

hΘ(x) = log(P(X) / 1 – P(X) ) = WX +B

If ‘Z’ goes to infinity, Y(predicted) will become 1, and if ‘Z’ goes to negative infinity, Y(predicted) will become 0.

The output from the hypothesis is the estimated probability. This is used to infer how confident can predicted value be actual value when given an input X.

**Cost Function**

**Cost ( hΘ(x) , Y(Actual)) = -log (hΘ(x)) if y=1**

**-log (1 – hΘ(x)) if y=0**

**Q2. Difference between logistic and linear regression?**

**Ans 2:**

Linear and Logistic regression are the most basic form of regression which are commonly used. The essential difference between these two is that Logistic regression is used when the dependent variable is binary. In contrast, Linear regression is used when the dependent variable is continuous, and the nature of the regression line is linear.

**Key Differences between Linear and Logistic Regression**

Linear regression models data using continuous numeric value. As against, logistic regression models the data in the binary values.

Linear regression requires to establish the linear relationship among dependent and independent variables, whereas it is not necessary for logistic regression.

In linear regression, the independent variable can be correlated with each other. On the contrary, in the logistic regression, the variable must not be correlated with each other.

**Q3. Why we can’t do a classification problem using Regression?**

**Ans 3:**

With linear regression you fit a polynomial through the data – say, like on the example below, we fit a straight line through {tumor size, tumor type} sample set:

Above, malignant tumors get 1, and non-malignant ones get 0, and the green line is our hypothesis h(x). To make predictions, we may say that for any given tumor size x, if h(x) gets bigger than 0.5,we predict malignant tumors. Otherwise, we predict benignly.

It looks like this way, we could correctly predict every single training set sample, but now let’s change the task a bit.

Intuitively it’s clear that all tumors larger certain threshold are malignant. So let’s add another sample with huge tumor size, and run linear regression again:

Now our h(x)>0.5→malignant doesn’t work anymore. To keep making correct predictions, we need to change it to h(x)>0.2 or something – but that not how the algorithm should work.

We cannot change the hypothesis each time a new sample arrives. Instead, we should learn it off the training set data, and then (using the hypothesis we’ve learned) make correct predictions for the data we haven’t seen before.

*Linear regression is unbounded.*

**Q4. What is Decision Tree?**

**Ans 4:**

A decision tree is a type of supervised learning algorithm that can be used in classification as well as regressor problems. The input to a decision tree can be both continuous as well as categorical. The decision tree works on an if-then statement. Decision tree tries to solve a problem by using tree representation (Node and Leaf)

Assumptions while creating a decision tree: 1) Initially all the training set is considered as a root 2) Feature values are preferred to be categorical, if continuous then they are discretized 3) Records are distributed recursively on the basis of attribute values 4) Which attributes are considered to be in root node or internal node is done by using a statistical approach.

**Q5. Entropy, Information Gain, Gini Index, Reducing Impurity?**

**Ans 5:**

There are different attributes which define the split of nodes in a decision tree. There are few algorithms to find the optimal split

* 1) ID3(Iterative Dichotomiser 3):* This solution uses Entropy and Information gain as metrics to form a better decision tree. The attribute with the highest information gain is used as a root node, and a similar approach is followed after that. Entropy is the measure that characterizes the impurity of an arbitrary collection of examples.

Entropy varies from 0 to 1. 0 if all the data belong to a single class and 1 if the class distribution is equal. In this way, entropy will give a measure of impurity in the dataset.

Steps to decide which attribute to split:

1. Compute the entropy for the dataset

2. For every attribute:

2.1 Calculate entropy for all categorical values.

2.2 Take average information entropy for the attribute.

2.3 Calculate gain for the current attribute.

3. Pick the attribute with the highest information gain.

4. Repeat until we get the desired tree.

A leaf node is decided when entropy is zero

Information Gain = 1 – ∑ (Sb/S)*Entropy (Sb)

Sb – Subset, S – entire data

** 2) CART Algorithm (Classification and Regression trees):** In CART, we use the GINI index as a metric. Gini index is used as a cost function to evaluate split in a dataset

Steps to calculate Gini for a split:

1. Calculate Gini for subnodes, using formula sum of the square of probability for success and failure (p2+q2).

2. Calculate Gini for split using weighted Gini score of each node of that split.

**Choose the split based on higher Gini value**

**Split on Gender:**

Gini for sub-node Female = (0.2)*(0.2)+(0.8)*(0.8)=0.68

Gini for sub-node Male = (0.65)*(0.65)+(0.35)*(0.35)=0.55

Weighted Gini for Split Gender = (10/30)*0.68+(20/30)*0.55 = 0.59

**Similar for Split on Class:**

Gini for sub-node Class IX = (0.43)*(0.43)+(0.57)*(0.57)=0.51

Gini for sub-node Class X = (0.56)*(0.56)+(0.44)*(0.44)=0.51

Weighted Gini for Split Class = (14/30)*0.51+(16/30)*0.51 = 0.51

Here Weighted Gini is high for gender, so we consider splitting based on gender

**Q6. How to control leaf height and Pruning?**

**Ans 6:**

To control the leaf size, we can set the parameters:-

**1. Maximum depth :**

Maximum tree depth is a limit to stop the further splitting of nodes when the specified tree depth has been reached during the building of the initial decision tree.

**NEVER use maximum depth to limit the further splitting of nodes. In other words: use the largest possible value.**

**2. Minimum split size:**

Minimum split size is a limit to stop the further splitting of nodes when the number of observations in the node is lower than the minimum split size.

This is a good way to limit the growth of the tree. When a leaf contains too few observations, further splitting will result in overfitting (modeling of noise in the data).

**3. Minimum leaf size**

Minimum leaf size is a limit to split a node when the number of observations in one of the child nodes is lower than the minimum leaf size.

**Pruning** is mostly done to reduce the chances of overfitting the tree to the training data and reduce

the overall complexity of the tree.

There are two types of pruning: **Pre-pruning** and **Post-pruning.**

1. Pre-pruning is also known as the **early stopping criteria**. As the name suggests, the criteria

are set as parameter values while building the model. The tree stops growing when it meets

any of these pre-pruning criteria, or it discovers the pure classes.

2. In Post-pruning, the idea is to allow the decision tree to grow fully and observe the CP value. Next, we prune/cut the tree with the optimal **CP(Complexity Parameter) **value as the parameter.

The CP (complexity parameter) is used to control tree growth. If the cost of adding a variable is higher, then the value of CP, tree growth stops.

**Q7. How to handle a decision tree for numerical and categorical data?**

**Ans 7:**

Decision trees can handle both categorical and numerical variables at the same time as features. There is not any problem in doing that.

Every split in a decision tree is based on a feature.

**1. If the feature is categorical, the split is done with the elements belonging to a particular class.**

**2. If the feature is continuous, the split is done with the elements higher than a threshold.**

At every split, the decision tree will take the best variable at that moment. This will be done according to an impurity measure with the split branches. And the fact that the variable used to do split is categorical or continuous is irrelevant (in fact, decision trees categorize continuous variables by creating binary regions with the threshold).At last, the good approach is to always convert your categoricals to continuous using LabelEncoder or OneHotEncoding.

**Q8. What is the Random Forest Algorithm?**

**Ans 8:**

Random Forest is an ensemble machine learning algorithm that follows the bagging technique. The base estimators in the random forest are decision trees. Random forest randomly selects a set of features that are used to decide the best split at each node of the decision tree.

Looking at it step-by-step, this is what a random forest model does:

1. Random subsets are created from the original dataset (bootstrapping).

2. At each node in the decision tree, only a random set of features are considered to decide the best split.

3. A decision tree model is fitted on each of the subsets.

4. The final prediction is calculated by averaging the predictions from all decision trees.

*To sum up, the Random forest randomly selects data points and features and builds multiple trees (Forest).*

Random Forest is used for feature importance selection. The attribute (.feature_importances_) is used to find feature importance.

Some Important Parameters:-

**1. n_estimators:-** It defines the number of decision trees to be created in a random forest.

**2. criterion:- **“Gini” or “Entropy.”

**3. min_samples_split:-** Used to define the minimum number of samples required in a leaf node before a split is attempted

**4. max_features: –**It defines the maximum number of features allowed for the split in each decision tree.

**5. n_jobs:-** The number of jobs to run in parallel for both fit and predict. Always keep (-1) to use all the cores for parallel processing.

**Q9. What is Variance and Bias tradeoff?**

**Ans 9:**

In predicting models, the prediction error is composed of two different errors

**1. Bias**

**2. Variance**

It is important to understand the variance and bias trade-off which tells about to minimize the Bias and Variance in the prediction and avoids overfitting & under fitting of the model.

**Bias: **It is the difference between the expected or average prediction of the model and the correct value which we are trying to predict. Imagine if we are trying to build more than one model by collecting different data sets, and later on, evaluating the prediction, we may end up by different prediction for all the models. So, bias is something which measures how far these model prediction from the correct prediction. It always leads to a high error in training and test data.

**Variance:** Variability of a model prediction for a given data point. We can build the model multiple times, so the variance is how much the predictions for a given point vary between different realizations of the model.

**For example:** Voting Republican – 13 Voting Democratic – 16 Non-Respondent – 21 Total – 50 The probability of voting Republican is 13/(13+16), or 44.8%. We put out our press release that the Democrats are going to win by over 10 points; but, when the election comes around, it turns out they lose by 10 points. That certainly reflects poorly on us. Where did we go wrong in our model?

**Bias scenario’s:** using a phone book to select participants in our survey is one of our sources of bias. By only surveying certain classes of people, it skews the results in a way that will be consistent if we repeated the entire model building exercise. Similarly, not following up with respondents is another source of bias, as it consistently changes the mixture of responses we get. On our bulls-eye diagram,these move us away from the center of the target, but they would not result in an increased scatter of estimates.

**Variance scenarios:** the small sample size is a source of variance. If we increased our sample size, the results would be more consistent each time we repeated the survey and prediction. The results still might be highly inaccurate due to our large sources of bias, but the variance of predictions will be reduced

**Q10. What are Ensemble Methods?**

**Ans 10:**

1.** Bagging** and **Boosting**

Decision trees have been around for a long time and also known to suffer from bias and variance. You will have a large bias with simple trees and a large variance with complex trees.

**Ensemble methods –** which combines several decision trees to produce better predictive performance than utilizing a single decision tree. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner.

Two techniques to perform ensemble decision trees:

1. Bagging

2. Boosting

**Bagging (Bootstrap Aggregation)** is used when our goal is to reduce the variance of a decision tree. Here the idea is to create several subsets of data from the training sample chosen randomly with replacement. Now, each collection of subset data is used to train their decision trees. As a result, we end up with an ensemble of different models. Average of all the predictions from different trees are used which is more robust than a single decision tree.

**Boosting** is another ensemble technique to create a collection of predictors. In this technique, learners are learned sequentially with early learners fitting simple models to the data and then analyzing data for errors. In other words, we fit consecutive trees (random sample), and at every step, the goal is to solve for net error from the prior tree.

When a hypothesis misclassifies an input, its weight is increased, so that the next hypothesis is more likely to classify it correctly. By combining the whole set at the end converts weak learners into a better performing model.

The different types of boosting algorithms are:

**1. AdaBoost**

**2. Gradient Boosting**

**3. XGBoost**

**Q11. What is SVM Classification?**

**Ans 11:**

SVM or Large margin classifier is a supervised learning algorithm that uses a powerful technique called SVM for classification.

We have two types of SVM classifiers:

**1) Linear SVM:** In Linear SVM, the data points are expected to be separated by some apparent gap. Therefore, the SVM algorithm predicts a straight hyperplane dividing the two classes. The hyperplane is also called as maximum margin hyperplane

**2) Non-Linear SVM:** It is possible that our data points are not linearly separable in a p-dimensional space, but can be linearly separable in a higher dimension. Kernel tricks make it possible to draw nonlinear hyperplanes. Some standard kernels are a) Polynomial Kernel b) RBF kernel(mostly used).

**Advantages of SVM classifier:**

1) SVMs are effective when the number of features is quite large.

2) It works effectively even if the number of features is greater than the number of samples.

3) Non-Linear data can also be classified using customized hyperplanes built by using kernel trick.

4) It is a robust model to solve prediction problems since it maximizes margin.

**Disadvantages of SVM classifier:**

1) The biggest limitation of the Support Vector Machine is the choice of the kernel. The wrong choice of the kernel can lead to an increase in error percentage.

2) With a greater number of samples, it starts giving poor performances.

3) SVMs have good generalization performance, but they can be extremely slow in the test phase.

4) SVMs have high algorithmic complexity and extensive memory requirements due to the use of quadratic programming.

**Q12. What is Naive Bayes Classification and Gaussian Naive Bayes**

**Ans 12:**

Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already occurred. Bayes’ theorem is stated mathematically as the following equation:

Now, with regards to our dataset, we can apply Bayes’ theorem in following way:

P(y|X) = {P(X|y) P(y)}/{P(X)}

where, y is class variable and X is a dependent feature vector (of size n) where:

X = (x_1,x_2,x_3,…..,x_n)

To clear, an example of a feature vector and corresponding class variable can be: (refer 1st row of the dataset)

X = (Rainy, Hot, High, False) y = No So basically, P(X|y) here means, the probability of “Not playing golf” given that the weather conditions are “Rainy outlook”, “Temperature is hot”, “high humidity” and “no wind”.

**Naive Bayes Classification:**

1. We assume that no pair of features are dependent. For example, the temperature being ‘Hot’ has nothing to do with the humidity, or the outlook being ‘Rainy’ does not affect the winds. Hence, the features are assumed to be independent.

2. Secondly, each feature is given the same weight (or importance). For example, knowing the only temperature and humidity alone can’t predict the outcome accurately. None of the attributes is irrelevant and assumed to be contributing equally to the outcome

**Gaussian Naive Bayes**

Continuous values associated with each feature are assumed to be distributed according to a Gaussian distribution. A Gaussian distribution is also called Normal distribution. When plotted, it gives a bell-shaped curve which is symmetric about the mean of the feature values as shown below:

This is as simple as calculating the mean and standard deviation values of each input variable (x) for each class value.

Mean (x) = 1/n * sum(x)

Where n is the number of instances, and x is the values for an input variable in your training data.We can calculate the standard deviation using the following equation:

Standard deviation(x) = sqrt (1/n * sum(xi-mean(x)^2 ))

When to use what? Standard Naive Bayes only supports categorical features, while Gaussian Naive Bayes only supports continuously valued features.

**Q13. What is the Confusion Matrix?**

**Ans 13:**

A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm.

A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class.

This is the key to the confusion matrix.

It gives us insight not only into the errors being made by a classifier but, more importantly, the types of errors that are being made.

Here,

** Class 1: Positive**

** Class 2: Negative**

Definition of the Terms:

**1. Positive (P):** Observation is positive (for example: is an apple).

**2. Negative (N):** Observation is not positive (for example: is not an apple).

**3. True Positive (TP): **Observation is positive, and is predicted to be positive.

**4. False Negative (FN):** Observation is positive, but is predicted negative.

**5. True Negative (TN):** Observation is negative, and is predicted to be negative.

**6. False Positive (FP):** Observation is negative, but is predicted positive.

**Q14. What is Accuracy and Misclassification Rate?**

**Ans 14:**

Accuracy

Accuracy is defined as the ratio of the sum of True Positive and True

Negative by Total(TP+TN+FP+FN)

**However, there are problems with accuracy. It assumes equal costs for both kinds of errors. A 99%** **accuracy can be excellent, good, mediocre, poor, or terrible depending upon the problem.**

**Misclassification Rate**

Misclassification Rate is defined as the ratio of the sum of False Positive and False

Negative by Total(TP+TN+FP+FN)

Misclassification Rate is also called Error Rate.

**Q15. True Positive Rate & True Negative Rate**

**Ans 15:**

**True Positive Rate:**

**Sensitivity (SN)** is calculated as the number of correct positive predictions divided by the total number of positives. It is also called **Recall (REC)** or true positive rate (TPR). The best sensitivity is 1.0, whereas the worst is 0.0.

**True Negative Rate:**

Specificity (SP) is calculated as the number of correct negative predictions divided by the total number of negatives. It is also called a true negative rate (TNR). The best specificity is 1.0, whereas the worst is 0.0.

**Q16. What is False Positive Rate & False negative Rate?**

**Ans 16:**

**False Positive Rate**

False positive rate (FPR) is calculated as the number of incorrect positive predictions divided by the total number of negatives. The best false positive rate is 0.0, whereas the worst is 1.0. It can also be calculated as 1 – specificity.

**False Negative Rate**

False Negative rate (FPR) is calculated as the number of incorrect positive predictions divided by the total number of positives. The best false negative rate is 0.0, whereas the worst is 1.0.

**Q17. What are F1 Score, precision and recall?**

**Ans 17:**

Recall:-

Recall can be defined as the ratio of the total number of correctly classified positive examples divide to the total number of positive examples.

1. High Recall indicates the class is correctly recognized (small number of FN).

2. Low Recall indicates the class is incorrectly recognized (large number of FN).

Recall is given by the relation:

Precision:

To get the value of precision, we divide the total number of correctly classified positive examples by the total number of predicted positive examples.

1. High Precision indicates an example labeled as positive is indeed positive (a small number of FP).

2. Low Precision indicates an example labeled as positive is indeed positive (large number of FP).

The relation gives precision:

Remember:-

High recall, low precision: This means that most of the positive examples are correctly recognized (low FN), but there are a lot of false positives.

Low recall, high precision: This shows that we miss a lot of positive examples (high FN), but those we predict as positive are indeed positive (low FP).

**F-measure/F1-Score:**

Since we have two measures (Precision and Recall), it helps to have a measurement that represents both of them. We calculate an **F-measure, which uses Harmonic Mean in place of Arithmetic Mean as it punishes the extreme values more.**

**The F-Measure will always be nearer to the smaller value of Precision or Recall.**

**Q18. What is RandomizedSearchCV?**

**Ans 18:**

Randomized search CV is used to perform a random search on hyperparameters. Randomized search CV uses a fit and score method, predict proba, decision_func, transform, etc..,The parameters of the estimator used to apply these methods are optimized by cross-validated search over parameter settings.

In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter.

Code Example :

class sklearn.model_selection.RandomizedSearchCV(estimator, param_distributions, n_iter=10, scoring=None, fit_params=None, n_jobs=None, iid=’warn’, refit=True, cv=’warn’, verbose=0, pre_dispatch=‘2n_jobs’, random_state=None, error_score=’raisedeprecating’, return_train_score=’warn’)

**Q19. What is GridSearchCV?**

**Ans 19:**

Grid search is the process of performing hyperparameter tuning to determine the optimal values for a given model.

CODE Example:-

from sklearn.model_selection import GridSearchCV from sklearn.svm import SVR gsc = GridSearchCV( estimator=SVR(kernel=’rbf’), param_grid={ ‘C’: [0.1, 1, 100, 1000], ‘epsilon’: [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10], ‘gamma’: [0.0001, 0.001, 0.005, 0.1, 1, 3, 5] }, cv=5, scoring=’neg_mean_squared_error’, verbose=0, n_jobs=-1)

Grid search runs the model on all the possible range of hyperparameter values and outputs the best model

**Q20. What is BaysianSearchCV?**

**Ans 20:**

Bayesian search, in contrast to the grid and random search, keeps track of past evaluation results,which they use to form a probabilistic model mapping hyperparameters to a probability of a score on the objective function.

Code:

from skopt import BayesSearchCV

opt = BayesSearchCV(

SVC(),

{

‘C’: (1e-6, 1e+6, ‘log-uniform’),

‘gamma’: (1e-6, 1e+1, ‘log-uniform’),

‘degree’: (1, 8), # integer valued parameter

‘kernel’: [‘linear’, ‘poly’, ‘rbf’]

},

n_iter=32, cv=3)

**Q21. What is ZCA Whitening?**

**Ans 21:**

Zero Component Analysis:

Making the co-variance matrix as the Identity matrix is called whitening. This will remove the first and second-order statistical structure

ZCA transforms the data to zero means and makes the features linearly independent of each otherIn some image analysis applications, especially when working with images of the color and tiny type, it is frequently interesting to apply some whitening to the data before, e.g. training a classifier.

**If you are looking for affordable tech course such as data science, machine learning, deep learning,cloud and many more you can go ahead with** **iNeuron oneneuron platform where you will able to get 200+ tech courses at an affordable price for a lifetime access.**

The post Day 2- Data Science Interview Prepartion appeared first on Krish Naik.

]]>The post Day 1- Data Science Interview Preparation appeared first on Krish Naik.

]]>This post will help you to prepare for a Data Science interview (**30 days of Interview Preparation**) by covering everything from the fundamentals to the more advanced levels of job interview questions and answers. Let’s take a look:-

**AI vs Data Science vs ML vs DL****Supervised learning vs Unsupervised learning vs****Reinforcement learning****Architecture of Machine learning****OLS Stats Model****Linear Regression**

**Q1. What is the difference between AI, Data Science, ML, and DL?**

__Ans 1:__

**Artificial Intelligence:** AI is purely math and scientific exercise, but when it became computational, it started to solve human problems formalized into a subset of computer science. Artificial intelligence has changed the original computational statistics paradigm to the modern idea that machines could mimic actual human capabilities, such as decision making and performing more “human” tasks. Modern AI into two categories

* 1. General AI *– Planning, decision making, identifying objects, recognizing sounds, social & business transactions

* 2. Applied AI *– driverless/ Autonomous car or machine smartly trade stocks Machine Learning: Instead of engineers “teaching” or programming computers to have what they need to carry out tasks, that perhaps computers could teach themselves – learn something without being explicitly programmed to do so. ML is a form of AI where based on more data, and they can change actions and response, which will make more efficient, adaptable and scalable. e.g., navigation apps and recommendation engines. Classified into:-

**1. Supervised**

**2. Unsupervised**

**3. Reinforcement learning**

**Data Science: **Data science has many tools, techniques, and algorithms called from these fields, plus others –to handle big data

The goal of data science, somewhat similar to machine learning, is to make accurate predictions and to automate and perform transactions in real-time, such as purchasing internet traffic or automatically generating content.

Data science relies less on math and coding and more on data and building new systems to process the data. Relying on the fields of data integration, distributed architecture, automated machine learning, data visualization, data engineering, and automated data-driven decisions, data science can cover an entire spectrum of data processing, not only the algorithms or statistics related to data.

** Deep Learning:** It is a technique for implementing ML.

ML provides the desired output from a given input, but DL reads the input and applies it to another data. In ML, we can easily classify the flower based upon the features. Suppose you want a machine to look at an image and determine what it represents to the human eye, whether a face, flower, landscape, truck, building, etc.

Machine learning is not sufficient for this task because machine learning can only produce an output from a data set – whether according to a known algorithm or based on the inherent structure of the data. You might be able to use machine learning to determine whether an image was of an “X” – a flower, say – and it would learn and get more accurate. But that output is binary (yes/no) and is dependent on the algorithm, not the data. In the image recognition case, the outcome is not binary and not dependent on the algorithm.

The neural network performs MICRO calculations with computational on many layers. Neural networks also support weighting data for ‘confidence. These results in a probabilistic system, vs. deterministic, and can handle tasks that we think of as requiring more ‘human-like’ judgment.

**Q2. What is the difference between Supervised learning, Unsupervised learning and Reinforcement learning? **

**Ans 2:**

**Machine Learning:**

Machine learning is the scientific study of algorithms and statistical models that computer systems use to effectively perform a specific task without using explicit instructions, relying on patterns and inference instead.

Building a model by learning the patterns of historical data with some relationship between data to make a data-driven prediction.

**Types of Machine Learning**

• Supervised Learning

• Unsupervised Learning

• Reinforcement Learning

**Supervised learning:**

In a supervised learning model, the algorithm learns on a labeled dataset, to generate reasonable predictions for the response to new data. (Forecasting outcome of new data)

• Regression

• Classification

**Unsupervised learning**

An unsupervised model, in contrast, provides unlabelled data that the algorithm tries to make sense of by extracting features, co-occurrence and underlying patterns on its own. We use unsupervised learning for

• Clustering

• Anomaly detection

• Association

• Autoencoders

**Reinforcement Learning**

Reinforcement learning is less supervised and depends on the learning agent in determining the output solutions by arriving at different possible ways to achieve the best possible solution.

**Q3. Describe the general architecture of Machine learning.**

**Ans 3:**

**Business understanding: **Understand the give use case, and also, it’s good to know more about the domain for which the use cases are built.

**Data Acquisition and Understanding:** Data gathering from different sources and understanding the data. Cleaning the data, handling the missing data if any, data wrangling, and EDA( Exploratory data analysis).

**Modeling:** Feature Engineering – scaling the data, feature selection – not all features are important. We use the backward elimination method, correlation factors, PCA and domain knowledge to select the features.

*Model Training* based on trial and error method or by experience, we select the algorithm and train with the selected features.

*Model evaluation* Accuracy of the model , confusion matrix and cross-validation.If accuracy is not high, to achieve higher accuracy, we tune the model…either by changing the algorithm used or by feature selection or by gathering more data, etc.

**Deployment – **Once the model has good accuracy, we deploy the model either in the cloud or Rasberry py or any other place. Once we deploy, we monitor the performance of the model.if its good…we go live with the model or reiterate the all process until our model performance is good.

It’s not done yet!!!

What if, after a few days, our model performs badly because of new data. In that case, we do all the process again by collecting new data and redeploy the model.

**Q4. What is Linear Regression?**

**Ans 4:**

Linear Regression tends to establish a relationship between a dependent variable(Y) and one or more independent variable(X) by finding the best fit of the straight line.

The equation for the Linear model is Y = mX+c, where m is the slope and c is the intercept

In the above diagram, the blue dots we see are the distribution of ‘y’ w.r.t ‘x.’ There is no straight line that runs through all the data points. So, the objective here is to fit the best fit of a straight line that will try to minimize the error between the expected and actual value.

**Q5. OLS Stats Model (Ordinary Least Square)**

**Ans 5:**

OLS is a stats model, which will help us in identifying the more significant features that can has an influence on the output. OLS model in python is executed as:

lm = smf.ols(formula = ‘Sales ~ am+constant’, data = data).fit() lm.conf_int() lm.summary()

And we get the output as below,

**The higher the t-value for the feature, the more significant the feature is to the output variable. **And also, the p-value plays a rule in rejecting the Null hypothesis(Null hypothesis stating the features has zero significance on the target variable.). **If the p-value is less than 0.05(95% confidence interval) for a feature, then we can consider the feature to be significant.**

**Q6. What is L1 Regularization (L1 = lasso) ?**

**Ans 6:**

The main objective of creating a model(training data) is making sure it fits the data properly and reduce the loss. Sometimes the model that is trained which will fit the data but it may fail and give a poor performance during analyzing of data (test data). This leads to overfitting. Regularization came to overcome overfitting.

Lasso Regression (**Least Absolute Shrinkage and Selection Operator**) adds “Absolute value of magnitude” of coefficient, as penalty term to the loss function.

Lasso shrinks the less important feature’s coefficient to zero; thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features.

Methods like Cross-validation, Stepwise Regression are there to handle overfitting and perform feature selection work well with a small set of features. These techniques are good when we are dealing with a large set of features.

Along with shrinking coefficients, the lasso performs feature selection, as well. (Remember the ‘selection‘ in the lasso full-form?) Because some of the coefficients become exactly zero, which is equivalent to the particular feature being excluded from the model.

**Q7. L2 Regularization(L2 = Ridge Regression)**

**Ans 7:**

Overfitting happens when the model learns signal as well as noise in the training data and wouldn’t perform well on new/unseen data on which model wasn’t trained on.

To avoid overfitting your model on training data like **cross-validation sampling, reducing the number of features, pruning, regularization, **etc.

**So to avoid overfitting, we perform Regularization.**

**The Regression model that uses L2 regularization is called Ridge Regression.The formula for Ridge Regression:**

**Regularization adds the penalty as model complexity increases. The regularization parameter** **(lambda) penalizes all the parameters except intercept so that the model generalizes the data and** **won’t overfit.**

**Ridge regression adds “squared magnitude of the coefficient” as penalty term to the loss function. Here the box part in the above image represents the L2 regularization element/term.**

**Lambda is a hyperparameter.**

If lambda is zero, then it is equivalent to OLS. But if lambda is very large, then it will add too much weight, and it will lead to under-fitting.

Ridge regularization forces the weights to be small but does not make them zero and does not give the sparse solution.

Ridge is not robust to outliers as square terms blow up the error differences of the outliers, and the regularization term tries to fix it by penalizing the weights

Ridge regression performs better when all the input features influence the output, and all with weights are of roughly equal size.

L2 regularization can learn complex data patterns

**Q8. What is R square(where to use and where not)?**

**Ans 8:**

**R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for** **multiple regression.**

The definition of R-squared is the percentage of the response variable variation that is explained by a linear model.

R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%.

0% indicates that the model explains none of the variability of the response data around its mean.

100% indicates that the model explains all the variability of the response data around its mean.

In general, the higher the R-squared, the better the model fits your data.

There is a problem with the R-Square. The problem arises when we ask this question to ourselves.** Is it good to help as many independent variables as possible?**

The answer is No because we understood that each independent variable should have a meaningful impact. But, even** if we add independent variables which are not meaningful**, will it improve R-Square value?

Yes, this is the basic problem with R-Square. How many junk independent variables or important independent variable or impactful independent variable you add to your model, the R-Squared value will always increase. It will never decrease with the addition of a newly independent variable, whether it could be an impactful, non-impactful, or bad variable, so we need another way to measure equivalent R-quare, which penalizes our model with any junk independent variable.

So, we calculate the Adjusted R-Square with a better adjustment in the formula of generic R-square.

**Q9. What is Mean Square Error?**

**Ans 9:**

**The mean squared error tells you how close a regression line is to a set of points. It does this by taking the distances from the points to the regression line (these distances are the “errors”) and squaring them.**

**Giving an intuition**

The line equation is** y=Mx+B**. We want to find **M (slope) **and **B (y-intercept)** that minimizes the

squared error.

**Q10. Why Support Vector Regression? Difference between SVR and a simple regression model?**

**Ans 10:**

In simple linear regression, try to minimize the error rate. But in SVR, we try to fit the error within a certain threshold.

**Main Concepts:-**

**1. Boundary**

**2. Kernel**

**3. Support Vector**

**4. Hyper Plane**

**Blueline: Hyper Plane; Red Line: Boundary-Line**

Our best fit line is the one where the hyperplane has the maximum number of points.We are trying to do here is trying to decide a decision boundary at ‘e’ distance from the original hyperplane such that data points closest to the hyperplane or the support vectors are within that boundary line

**If you are looking for affordable tech course such as data science, machine learning, deep learning,cloud and many more you can go ahead with** **iNeuron oneneuron platform where you will able to get 200+ tech courses at an affordable price for a lifetime access.**

The post Day 1- Data Science Interview Preparation appeared first on Krish Naik.

]]>The post Important Interview Questions On Decision Tree Machine Learning Algorithm appeared first on Krish Naik.

]]>Decision Tree Machine Learning Algorithms is a very important Machine Learning Algorithm through which we can solve both classification and regression problem statements. Decision Tree is also a base tree that is used in Bagging and Bossting techniques such as Random Forest and Xgboost Classification And Regression Algorithms.

All the important questions that can be asked in a Decision Tree are given below

Interview Questions:

- Decision Tree
- Entropy, Information Gain, Gini Impurity
- Decision Tree Working For Categorical and Numerical Features
- What are the scenarios where Decision Tree works well
- Decision Tree Low Bias And High Variance- Overfitting
- Hyperparameter Techniques
- Library used for constructing decision tree
- Impact of Outliers Of Decision Tree
- Impact of mising values on Decision Tree
- Does Decision Tree require Feature Scaling

First thing is to understand how decision tree works and how we split the decision tree based on entropy, Information gain and Gini impurity. You can check the below videos for the same

**Entropy In Decision Tree**

**Information Gain Intuition**

**Gini Impurity**

**And Finally you need to understand how to visualize Decision Tree**

There are no such assumptions

Advantages of Decision Tree

**Clear Visualization: The algorithm is simple to understand, interpret and visualize as the idea is mostly used in our daily lives. Output of a Decision Tree can be easily interpreted by humans.**Simple and easy to understand: Decision Tree looks like simple if-else statements which are very easy to understand.

Decision Tree can be used for both classification and regression problems.

Decision Tree can handle both continuous and categorical variables.

No feature scaling required: No feature scaling (standardization and normalization) required in case of Decision Tree as it uses rule based approach instead of distance calculation.

Handles non-linear parameters efficiently: Non linear parameters don’t affect the performance of a Decision Tree unlike curve based algorithms. So, if there is high non-linearity between the independent variables, Decision Trees may outperform as compared to other curve based algorithms.

Decision Tree can automatically handle missing values.

Decision Tree is usually robust to outliers and can handle them automatically.

Less Training Period: Training period is less as compared to Random Forest because it generates only one tree unlike forest of trees in the Random Forest.

Disadvantages of Decision Tree

- Overfitting: This is the main problem of the Decision Tree. It generally leads to overfitting of the data which ultimately leads to wrong predictions. In order to fit the data (even noisy data), it keeps generating new nodes and ultimately the tree becomes too complex to interpret. In this way, it loses its generalization capabilities. It performs very well on the trained data but starts making a lot of mistakes on the unseen data.

High variance: As mentioned in point 1, Decision Tree generally leads to the overfitting of data. Due to the overfitting, there are very high chances of high variance in the output which leads to many errors in the final estimation and shows high inaccuracy in the results. In order to achieve zero bias (overfitting), it leads to high variance.

Unstable: Adding a new data point can lead to re-generation of the overall tree and all nodes need to be recalculated and recreated.

Not suitable for large datasets: If data size is large, then one single tree may grow complex and lead to overfitting. So in this case, we should use Random Forest instead of a single Decision Tree.

No

It is not sensitive to outliers.Since, extreme values or outliers, never cause much reduction in RSS, they are never involved in split. Hence, tree based methods are insensitive to outliers.

- Classification
- Regression

How to avoid overfitting

- https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
- https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html

- Confusion Matrix
- Precision,Recall, F1 score

- R2,Adjusted R2
- MSE,RMSE,MAE

Download All the materials from here

**iNeuron oneneuron platform where you will able to get 200+ tech courses at an sffordable price for a lifetime access.**

The post Important Interview Questions On Decision Tree Machine Learning Algorithm appeared first on Krish Naik.

]]>The post Important Interview Questions On Random Forest Machine Learning Algorithm appeared first on Krish Naik.

]]>In this video we will be discussing about the important interview questions on Random Forest algorithm.

- Decision Tree
- Entropy, Information Gain, Gini Impurity
- Decision Tree Working For Categorical and Numerical Features
- What are the scenarios where Decision Tree works well
- Decision Tree Low Bias And High Variance- Overfitting
- Hyperparameter Techniques
- Library used for constructing decision tree
- Impact of Outliers Of Decision Tree
- Impact of mising values on Decision Tree
- Does Decision Tree require Feature Scaling

- Ensemble Techniques(Boosting And Bagging)
- Working of Random Forest Classifier
- Working of Random Forest Regresor
- Hyperparameter Tuning(Grid Search And RandomSearch)

Theoretical Understanding:

- Tutorial 37:Entropy In Decision Tree https://www.youtube.com/watch?v=1IQOtJ4NI_0
- Tutorial 38:Information Gain https://www.youtube.com/watch?v=FuTRucXB9rA
- Tutorial 39:Gini Impurity https://www.youtube.com/watch?v=5aIFgrrTqOw
- Tutorial 40: Decision Tree For Numerical Features: https://www.youtube.com/watch?v=5O8HvA9pMew
- How To Visualize DT: https://www.youtube.com/watch?v=ot75kOmpYjI

Theoretical Understanding:

- Ensemble technique(Bagging): https://www.youtube.com/watch?v=KIOeZ5cFZ50
- Random forest Classifier And Regressor https://www.youtube.com/watch?v=nxFG5xdpDto
- Construct Decision Tree And working in Random Forest: https://www.youtube.com/watch?v=WQ0iJSbnnZA&t=406s

Decision Tree—Low Bias And High Variance

Ensemble Bagging(Random Forest Classifier)–Low Bias And Low Variance

There are no such assumptions

Advantages of Random Forest

**Doesn’t Overfit**Favourite algorithm for Kaggle competition

Less Parameter Tuning required

Decision Tree can handle both continuous and categorical variables.

No feature scaling required: No feature scaling (standardization and normalization) required in case of Random Forest as it uses DEcision Tree internally

Suitable for any kind of ML problems

Disadvantages of Random Forest

1.**Biased With features having many categories**

- Biased in multiclass classification problems towards more frequent classes.

No

Robust to Outliers

- Classification
- Regression

- https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

- Confusion Matrix
- Precision,Recall, F1 score

- R2,Adjusted R2
- MSE,RMSE,MAE

Download the github material from here

The post Important Interview Questions On Random Forest Machine Learning Algorithm appeared first on Krish Naik.

]]>