Theory of Optimizers

4 min readApr 3, 2022

This article describes about different optimizers used in machine learning.

Optimizer

Given a loss surface, the optimizer algorithm will find the parameters that has minimum loss. This is achieved by the below idea

New Weight = Old Weight — learningrate*gradient_of_that_point

To describe, consider a erros surface of one parameter which is a Parabola, if the slope of the current weight is positive, we need to add negative gradient value to the current weight, and if it is negative, we need to subtract the negative gradient value to current weight so as to converge towards the minimum. This approach is called Gradient Descent.

Gradient Descent

In Gradient Descent, the weight update happens only after iterating over all the datapoints. Running sum of gradients are captured for all the datapoints and then the weights are updated for each epoch. This is time consuming process as in real time the dataset will be huge and weight update will happen very slow. The work around for this will be updating the weights for each datapoint which is called Stochastic Gradient Descent.

Stochastic Gradient Descent

The weight update in this case happens for each data point, which implies the weights at an instance depends on current datapoint and not on the overall dataset. The weight update happens instantaneously but the issue is there will be more noise associated due to its stochastic behaviour on updating the weight. To remove this stochastic nature and also to speed up the convergence Mini Batch Gradient Descent comes into picture.

Mini Batch Gradient Descent

Instead of updating the weights after computing gradients of each datapoint, the idea is to update the weights after computing the running sum of gradients of ‘m’ number of datapoints called a Batch. Whilst updating the weight after iterating bunch of data which is nothing but sample of the entire dataset , the Stochasticity is remove retaining the convergence speed as this will not be slow as much as Gradient Descent which do One update after iterating all datapoints.

Momentum Based Gradient Descent

There are chances that the solution may get stuck in plateau region when the weight update happens with constant rate, to overcome this, exponential weighted average along with current gradient is considered to update the weights called as ‘Momentum’. It is experimentally proved that Momentum based Gradient Descent converges rapidly than the Vannila Gradient Descent. The disadvantages is, because of the larger steps, this will have huge oscillations between positive slope and negative slope while finding the minima, and Nesterov had an excellent idea to resolve this.

Nesterov Momentum

Yurii Nesterov describes that instead of taking the step based on current gradient, we can take step based on the gradient of future data point and update the weights of current datapoint. The idea here is, when the gradient is calculated for future datapoint and if there is change of sign then a much smaller step is taken to avoid oscillation.

Adagrad

Adagrad stands for Gradient Descent with Adaptive learning rate. When a feature is sparse for most of the datapoint, the weight update happens in very slow pace, which will interpret that the feature is not important even if the feature is highly correlated with the target. So, the idea is to have seperate learning rate for each weights associated with the feature and update the gradient such a way that it is inversely proportional to the history of Squre of Gradients. i.e If the gradients so far is high, then a low learning rate is chosen, and if the feature has very less weight updates , then the learning rate for that will be increased. The disadvantage of this approach is that if weights for a feature grows larger, then the update will slow down and the updates will not happen properly for that feature and it is observed that it stuck nearer to the convergence which Rmsprop comes to rescue.

Rmsprop

Instead of updating the weights inversely proportional to the square of the gradients, an exponential weighted average is taken in Rmsprop. i.e giving importance to the more recent gradients. For instance, if the past few gradients are higher, then decrease the learning rate by that factor and vice versa.

Adam

Adam optmizer adopts the advantages of both Rmsprop as well as Adagrad. The learning rate is based on the second moment (squared gradient) and the weights are updated based on the first moment(exponential weighted average of the gradients)

Conclusion

Choosing optimizer for a model is purely hyperparameter, but commonly the first try will be Adam as it incorporates all the advantages of the optimizers that we discussed so far.

Source:

Banner Image : https://arxiv.org/pdf/1805.04829.pdf