Gradient Descent Extensions to Your Deep Learning Models

Learn about the different available methods, and to select the one most appropriate to solve your problem.

Source: Pixabay


The objective of this article is to explore the different Gradient Descent extensions such as Momentum, Adagrad, RMSprop…

Inprevious articles, we have studied three methods to implement back-propagation in Deep Learning models:

  • Gradient Descent
  • Stochastic Gradient Descent
  • Mini-Batch Stochastic Gradient Descent

Upon which, we keep the mini-batch because it allows for greater speed, as it does not have to calculate gradients and errors for the entire dataset, and eliminates the high variability that exists in the Stochastic Gradient Descent.

Well, there are improvements over these methods, such as Momentum. Besides, there are other more complex algorithms such as Adam, RMSProp or Adagrad.

Let’s see them!


Imagine being a kid again and having the great idea of putting on your skates, climbing up the steepest street and starting to go down it. You are total beginners and this is the second time you have worn skates.

I don’t know if any of you have ever really done this, but well, I have, so let me explain what happens:

  • You just start, the speed is small, you even seem to be in control and you could stop at any time.
  • But the lower you go, the faster you move: this is called momentum.
    so the more road you go down, the more inertia you carry and the faster you go.
  • Well, for those of you who are curious, the end of the story is that at the end of the steep street there is a fence. The rest you can imagine…

Well, the Momentum technique is precisely this. As we go down our loss curve when calculating the gradients and making the updates, we give more importance to the updates that go in the direction that minimizes the gradient, and less importance to those that go in other directions.

Figure by the Author

So, the result is to speed up the training of the network.

Also, thanks to the moment, we could have been able to avoid small potholes or holes in the road (flying over them thanks to the speed).

You can learn more about the mathematic foundation behind this technique in this great post:

Nesterov Momentum

Going back to the example of before: we are going down the road at full speed (because we have built a lot of momentum) and suddenly we see the end of it. We would like to be able to brake, to slow down to avoid crashing. Well, this is precisely what Nesterov does.

Nesterov calculates the gradient, but instead of doing it at the current point, it does it at the point where we know our moment is going to take us, and then apply a correction.

Figure by Author

Notice that using the standard moment, we calculate the gradient (small orange vector) and then take a big step in the direction of the gradient (large orange vector).

Using Nesterov, we would first make a big jump in the direction of our previous gradient (green vector), measure the gradient and make the appropriate correction (red vector).

In practice, it works a little better than the momentum alone. It’s like calculating the gradient of weights in the future (because we have added the moment we had calculated).

You can learn more about the mathematic foundation behind this technique in this great post:

Both Nesterov’s momentum and the standard momentum are extensions of the SGD.

The methods that we are going to see now are based on adaptive learning rates, allowing us to accelerate or slow down the speed with which we update the weights. For example, we could use a high speed at the beginning, and lower it as we approach the minimum.

Adaptive gradient (AdaGrad)

It keeps a history of the calculated gradients (in particular, of the sum of the squared gradients) and normalizes the “step” of the update.

The intuition behind it is that it identifies the parameters with a very high gradient, which weights update will be very abrupt and then assign to them a lower learning rate to mitigate this abruptness.

At the same time, the parameters that have a very low gradient will be assigned a high learning rate.

In this way, we manage to accelerate the convergence of the algorithm.

You can learn more about the theory behind this technique in its original paper here:


The problem with AdaGrad is that when calculating the sum of the squared gradients, we are using a monotonic increasing function, which can cause the learning rate to try to compensate values that do not stop growing until it becomes zero, thus stopping learning.

What RMSprop proposes is to decrease that sum of the squared gradients using a decay_rate.

The paper is not published yet, but you can read more about it here:


Finally, Adam is one of the most modern algorithms, which improves RMSprop by adding momentum to the update rule. It introduces 2 new parameters, beta1 and beta2, with recommended values of 0.9 and 0.999.

You can check out its paper here: .

But then, which one should we use?

Source: original ADAM paper

As a rule of thumb, the recommendation is to start with Adam. If it does not works well, then you can try and tune the rest of the techniques. But most of the time, Adam works great.

You can check these resources to gain a better understanding of these techniques, how and when to apply them:

Final Words

As always, I hope you enjoyed the post!

If you liked this post then you can take a look at my other posts on Data Science and Machine Learning here .

If you want to learn more about Machine Learning, Data Science and Artificial Intelligence follow me on Medium , and stay tuned for my next posts!