Gradient descent is by far the most popular optimization strategy, used in machine learning and deep learning at the moment. It is used while training your model, can be combined with every algorithm and is easy to understand and implement. Therefore, everyone who works with Machine Learning should understand it’s concept. After reading this posts you will understand how Gradient Descent works, what types of it are used today and what are their advantages and tradeoffs.
Table of Contents:
- Introduction
- What is a Gradient?
- How it works
- Learning Rate
- How to make sure that it works properly
- Types of Gradient Descent
- Batch Gradient Descent
- Stochastic Gradient Descent
- Mini Batch Gradient Descent
- Summary
Introduction
Gradient Descent is used while training a machine learning model. It is an optimization algorithm, based on a convex function, that tweaks it’s parameters iteratively to minimize a given function to its local minimum.
It is simply used to find the values of a functions parameters (coefficients) that minimize a cost function as far as possible.
You start by defining the initial parameters values and from there on Gradient Descent iteratively adjusts the values, using calculus, so that they minimize the given cost-function. But to understand it’s concept fully, you first need to know what a gradient is.
What is a Gradient?
“A gradient measures how much the output of a function changes if you change the inputs a little bit.” – Lex Fridman (MIT)
It simply measures the change in all weights with regard to the change in error. You can also think of a gradient as the slope of a function. The higher the gradient, the steeper the slope and the faster a model can learn. But if the slope is zero, the model stops learning. Said it more mathematically, a gradient is a partial derivative with respect to its inputs.
Imagine a blindfolded man who wants to climb a hill, with the fewest steps possible. He just starts climbing the hill by taking really big steps in the steepest direction, which he can do, as long as he is not close to the top. As he comes further to the top, he will do smaller and smaller steps, since he doesn’t want to overshoot it. This process can be described mathematically, using the gradient.
Just take a look at the picture above. Imagine it illustrates our hill from a top-down view, where the red arrows show the steps of our climber.
Think of a gradient in this context as a vector that contains the direction of the steepest step the blindfolded man can go and also how long this step should be.
Note that the gradient ranging from X0 to X1 is much longer than the one reaching from X3 to X4. This is because the steepness/slope of the hill is less there, which determines the length of the vector. This perfectly represents the example of the hill, because the hill is getting less steep, the higher you climb it. Therefore a reduced gradient goes along with a reduced slope and a reduced step-size for the hill climber.
How it works
Gradient Descent can be thought of climbing down to the bottom of a valley, instead of climbing up a hill. This is because it is a minimization algorithm that minimizes a given function.
The equation below describes what Gradient Descent does: „b“ describes the next position of our climber, while „a“ represents his current position. The minus sign refers to the minimization part of gradient descent. The „gamma“ in the middle is a waiting factor and the gradient term ( Δf(a) ) is simply the direction of the steepest descent.
So this formula basically tells you the next position where you need to go, which is the direction of the steepest descent.
But to be sure that you fully grasp the concept, we will go through another example.
Imagine you are dealing with a machine learning problem and you want to train an algorithm with gradient descent to minimize your cost-function J(w, b) and reach its local minimum by tweaking its parameters (w and b).
Let’s take a look at the picture below, which is an illustration of Gradient Descent. The horizontal axes represent the parameters (w and b) and the cost function J(w, b) is represented on the vertical axes. You can also see in the image that gradient descent is a convex function.
Like you already know we want to find the values of W and B that correspond to the minimum of the cost function (marked with the red arrow). To start with finding the right values we initialize the values of W and B with some random numbers and Gradient Descent then starts at that point (somewhere around the top of our illustration). Then it takes one step after another in the steepest downside direction (e.g. from the top to the bottom of the illustration) till it reaches the point where the cost function is as small as possible.
Importance of the Learning Rate
How big the steps are that Gradient Descent takes into the direction of the local minimum are determined by the so-called learning rate. It determines how fast or slow we will move towards the optimal weights.
In order for Gradient Descent to reach the local minimum, we have to set the learning rate to an appropriate value, which is neither too low nor too high.
This is because if the steps it takes are too big, it maybe will not reach the local minimum because it just bounces back and forth between the convex function of gradient descent like you can see on the left side of the image below. If you set the learning rate to a very small value, gradient descent will eventually reach the local minimum but it will maybe take too much time like you can see on the right side of the image.
This is the reason why the learning rate should be neither too high nor too low. You can check if you’re learning rate is doing well by plotting the learning rate on a graph, which we will discuss in the section below.
How to make sure that it works properly
A good way to make sure that Gradient Descent runs properly is by plotting the cost function as Gradient Descent runs. You put the number of iterations on the x-axes and the value of the cost-function at the y-axes. This enables you to see the value of your cost function after each iteration of gradient descent. This lets you easily spot how appropriate your learning rate is. You just try different values for it and plot them all together.
You can see such a plot below on the left and the image on the right shows the difference between good and bad learning rates:
If gradient descent is working properly, the cost function should decrease after every iteration.
When Gradient Descent can’t decrease the cost-function anymore and remains more or less on the same level, we say it has converged. Note that the number of iterations that Gradient Descent needs to converge can sometimes vary a lot. It can take 50 iterations, 60,000 or maybe even 3 million. Therefore the number of iterations is hard to estimate in advance.
Out there are also some algorithms that can tell you automatically if Gradient Descent has converged but you need to define a threshold for the convergence beforehand which is also pretty hard to estimate. This is why these simple plots are the preferred convergence test.
Another advantage of monitoring Gradient Descent via plots is that you can easily spot if it doesn’t work properly, like for example if the cost function is increasing. Mostly the reason for an increasing cost-function, while using Gradient Descent, is that you’re learning rate is too high.
If you see in the plot that your learning curve is just going up and down, without really reaching a lower point, you also should try to decrease the learning rate. By the way, when you’re starting out with gradient descent on a given problem, just simply try 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1 etc. as it’s learning rates and look which one performs the best.
Types of Gradient Descent
Out there are three popular types of Gradient Descent, that mainly differ in the amount of data they use. We go through them one by one.
Batch Gradient Descent
Batch Gradient Descent, also called vanilla gradient descent, calculates the error for each example within the training dataset, but only after all training examples have been evaluated, the model gets updated. This whole process is like a cycle and called a training epoch.
Advantages of it are that it’s computational efficient, it produces a stable error gradient and a stable convergence. Batch Gradient Descent has the disadvantage that the stable error gradient can sometimes result in a state of convergence that isn’t the best the model can achieve. It also requires that the entire training dataset is in memory and available to the algorithm.
Stochastic Gradient Descent
Stochastic gradient descent (SGD) in contrary, does this for each training example within the dataset. This means that it updates the parameters for each training example, one by one. This can make SGD faster than Batch Gradient Descent, depending on the problem. One advantage is that the frequent updates allow us to have a pretty detailed rate of improvement.
The thing is that the frequent updates are more computationally expensive as the approach of Batch Gradient Descent. The frequency of those updates can also result in noisy gradients, which may cause the error rate to jump around, instead of slowly decreasing.
Mini Batch Gradient Descent
Mini-batch Gradient Descent is the go-to method since it’s a combination of the concepts of SGD and Batch Gradient Descent. It simply splits the training dataset into small batches and performs an update for each of these batches. Therefore it creates a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent.
Common mini-batch sizes range between 50 and 256, but like for any other machine learning techniques, there is no clear rule, because they can vary for different applications. Note that it is the go-to algorithm when you are training a neural network and it is the most common type of gradient descent within deep learning.
Summary
In this post, you learned a lot about Gradient Descent. You now know the basic terms of it and you have an understanding of how the algorithm works behind the scenes. Furthermore, you learned why the learning rate is it’s most important hyperparameter and how you can check if your algorithm learns properly. Lastly, you learned about the three most common types of Gradient Descent and their advantages and disadvantages. This knowledge enables you to train your models properly.