 # Understanding All Optimizers In Deep Learning

Many people may be using optimizers while training the neural network without knowing that the method is known as optimization. Optimizers are algorithms or methods used to change the attributes of your neural network such as weights and learning rates in order to reduce the losses.

How we should change your weights or learning rates of your neural network to reduce the losses is defined by the optimizer you use.

Optimization algorithm or strategies are responsible for reducing the losses and to provide the most accurate results possible.

In this blog, we learn about different types of optimization algorithms and their advantages.

Gradient descent is an iterative machine learning optimization algorithm to reduce the cost function. This will help models to make accurate predictions.

Gradient descent the direction of increase. As we want to find the minimum point in the valley we need to go in the opposite direction of the gradient. We update parameters n the negative gradient direction to minimize the loss. θ is the weight parameter, η is the learning rate and ∇J(θ;x,y) is the gradient of weight parameter θ

a. Easy Computation.

b. Easy to Implement

c. Easy to understand

a. May trap at local minima.

b. Weights are changed after calculating the gradient on the whole dataset. so, if the datasets are too large then this may take years to converge to the minima.

Different types of Gradient descent are:

In the batch gradient, we use the entire dataset to compute the gradient of the cost function for each iteration of the gradient descent and then update the weights.

Since we use the entire dataset to compute the gradient convergence is slow.

If the dataset is huge and contains millions or billions of data points then it is memory as well as computationally intensive.

• Theoretical analysis of weights and convergence rates are easy to understand

• Perform redundant computation for the same training example for large datasets
• Can be very slow and intractable as large datasets may not fit in the memory
• As we take the entire dataset for computation we can update the weights of the model for the new data.

It’s a variant of Gradient Descent. It tries to update the model’s parameters more frequently. In this, the model parameters are altered after the computation of loss on each training example. So, if the dataset contains 1000 rows SGD will update the model parameters 1000 times in one cycle of a dataset instead of one time as in Gradient Descent.

θ=θ−α⋅∇J(θ;x(i);y(i)) , where {x(i) ,y(i)} are the training examples.

As the model parameters are frequently updated parameters have high variance and fluctuations in loss functions at different intensities.

• Frequent updates of model parameters hence converge in less time.
• Requires less memory as no need to store values of loss functions.
• May get new minima’s.

• High variance in model parameters.
• May shoot even after achieving global minima.
• To get the same convergence as gradient descent needs to slowly reduce the value of the learning rate.

Mini-batch gradient is a variation of gradient descent where the batch size consists of more than one and less than the total dataset. Mini batch gradient descent is widely used and converges faster and is more stable. The batch size can vary depending on the dataset. As we take a batch with different samples, it reduces the noise which is the variance of the weight updates and this helps to have a more stable and faster convergence.

Just using gradient descent we can not fulfill our thirst. Here Optimizer comes in. Optimizers shape and mold our model into its most accurate possible form by updating the weights. The loss function guides the optimizer by telling it whether it is moving in the right direction to reach the bottom of the valley, the global minimum.

• Frequently updates the model parameters and also has less variance.
• Requires medium amount of memory.

All types of Gradient Descent have some challenges:

• Choosing an optimum value of the learning rate. If the learning rate is too small then gradient descent may take ages to converge.
• Have a constant learning rate for all the parameters. There may be some parameters that we may not want to change at the same rate.
• May get trapped at local minima.

### Other Types of Optimizers

It’s difficult to overstate how popular gradient descent really is, and it’s used across the board even up to complex neural net architectures (backpropagation is basically gradient descent implemented on a network). There are other types of optimizers based on gradient descent that are used though, and here are a few of them:

Another optimization strategy is called AdaGrad. The idea is that you keep the running sum of squared gradients during optimization. In this case, we have no momentum term, but an expression g that is the sum of the squared gradients.

When we update a weight parameter, we divide the current gradient by the root of that term g. To explain the intuition behind AdaGrad, imagine a loss function in a two-dimensional space in which the gradient of the loss function in one direction is very small and very high in the other direction.

Summing up the gradients along the axis where the gradients are small causes the squared sum of these gradients to become even smaller. If during the update step, we divide the current gradient by a very small sum of squared gradients g, the result of that division becomes very high and vice versa for the other axis with high gradient values.

As a result, we force the algorithm to make updates in any direction with the same proportions.

This means that we accelerate the update process along the axis with small gradients by increasing the gradient along that axis. On the other hand, the updates along the axis with the large gradient slow down a bit.

However, there is a problem with this optimization algorithm. Imagine what would happen to the sum of the squared gradients when training takes a long time. Over time, this term would get bigger. If the current gradient is divided by this large number, the update step for the weights becomes very small. It is as if we were using very low learning that becomes even lower the longer the training goes. In the worst case, we would get stuck with AdaGrad and the training would go on forever.

• Learning rate changes for each training parameter.
• Don’t need to manually tune the learning rate.
• Able to train on sparse data.

• Computationally expensive is a need to calculate the second-order derivative.
• The learning rate is always decreasing results in slow training.

It is an extension of AdaGrad which tends to remove the decaying learning Rate problem of it. Instead of accumulating all previously squared gradients, Adadelta limits the window of accumulated past gradients to some fixed size w. In this exponentially moving average is used rather than the sum of all the gradients.

E[g²](t)=γ.E[g²](t−1)+(1−γ).g²(t)

We set γ to a similar value as the momentum term, around 0.9.

• Now the learning rate does not decay and the training does not stop.

• Computationally expensive.

Adam (Adaptive Moment Estimation) works with momentums of first and second order. The intuition behind the Adam is that we don’t want to roll so fast just because we can jump over the minimum, we want to decrease the velocity a little bit for a careful search. In addition to storing an exponentially decaying average of past squared gradients like AdaDelta, Adam also keeps an exponentially decaying average of past gradients M(t).

M(t) and V(t) are values of the first moment which is the Mean and the second moment which is the uncentered variance of the gradients respectively.

Here, we are taking mean of M(t) and V(t) so that E[m(t)] can be equal to E[g(t)] where, E[f(x)] is an expected value of f(x).

To update the parameter:

The values for β1 is 0.9 , 0.999 for β2, and (10 x exp(-8)) for ‘ϵ’.

• The method is too fast and converges rapidly.
• Rectifies vanishing learning rate, high variance.

1. Computationally costly.

### What is the best Optimization Algorithm for Deep Learning?

Finally, we can discuss the question of what the best gradient descent algorithm is.

In general, a normal gradient descent algorithm is more than adequate for simpler tasks. If you are not satisfied with the accuracy of your model you can try out RMSprop or add a momentum term to your gradient descent algorithms.

But in my experience the best optimization algorithm for neural networks out there is Adam. This optimization algorithm works very well for almost any deep learning problem you will ever encounter. Especially if you set the hyperparameters to the following values:

• β1=0.9
• β2=0.999
• Learning rate = 0.001–0.0001

… this would be a very good starting point for any problem and virtually every type of neural network architecture I’ve ever worked with.

That’s why Adam Optimizer is my default optimization algorithm for every problem I want to solve. Only in very few cases do I switch to other optimization algorithms that I introduced earlier.

In this sense, I recommend that you always start with the Adam Optimizer, regardless of the architecture of the neural network of the problem domain you are dealing with.

### Conclusions

Adam is the best optimizer. If one wants to train the neural network in less time and more efficiently then Adam is the optimizer.

For sparse data use the optimizers with a dynamic learning rate.

If want to use a gradient descent algorithm then min-batch gradient descent is the best option.

I hope you guys liked the article and were able to give you a good intuition towards the different behaviors of different Optimization Algorithms.

Reference:

1. ranit roy