Why Negative Gradient in Gradient Descent

Solomon
2 min readNov 29, 2022
Photo by Bich Tran: https://www.pexels.com/photo/mountain-illustration-669987/

Gradient descent is widely used to find parameters of a model using loss function and the objective is to travel from random location to the minimum value in the direction. The direction here is to take the negative of the gradient. There are other interpretations for why negative gradient is taken in gradient descent, this article describes in terms of Taylor Series.

We want to find the value of f(x+ηd) . i.e. the value of function at future step using all the information that we have in the current step and the objective is that f(x+ηd) should be less than f(x)

Here ‘x’ is the current location (data), and η is the learning rate, and ‘d’ is the direction.

Taylor Series function helps to find it by using a series of terms. It defines that , to find the value of a function at future location (x+ηd), it can be done using the current location f(x).

f(x+ηd) = f(x) + η*d*f ` (x) + η²/2 *d*f ``(x) + …

The objective is to have f(x+ηd) less than f(x), in other words, f(x+ηd) — f(x) should be less than zero.

Rearranging the terms:

f(x+ηd) — f(x) =η*d*f ` (x) + η²/2 *d*f ``(x) + …

f(x+ηd) — f(x) should be less than zero means the R.H.S should be less than zero i.e.

η*d*f ` (x) + η²/2 *d*f ``(x) + … < 0

We can ignore the higher order terms because the learning rate η will be usually small 0.1 and as we take higher orders, it will decrease 0.001 , 0.0001.

f(x+ηd) — f(x) η*d*f ` (x) < 0

η*d*f ` (x) < 0

Here η is always positive, because it is the step size, and f’(x) is derivate of the current point (data) which depends on the data we have.

So to make the above equality to work, the only possibility is to make f ` (x) to be negative, which implies that η*d*f ` (x) to have negative value which implies that f(x+ηd) — f(x) will be negative and it proves if we take Negative Gradient it will leads to less value in next step which leads to convergence.

Conclusion

It is not always that it will lead to convergence because there might be multiple minimums in the function.

--

--

Solomon

Passionate about Data Science and applying Machine Learning,Deep Learning algorithms