In this post you will learn the most common Activation Functions within Deep Learning and when you should use them. You will also discover why you mostly need to use non-linear activation functions.
It is important to know which activation functions to use within your neural network. Be aware of the fact that you can use different activation functions at different layers. In my previous posts I only used the sigmoid function but often other functions can work much better.
tanh
A activation function that nearly almost works better than the sigmoid function is the tanh activation function.
The tanh function is actually mathematically a shifted version of the sigmoid function. The sigmoid function only maps values between 0 and 1 but the tanh function maps them between -1 and 1.
Using it within the units of a neural network almost always works a lot better than using the sigmoid function.
Because of the values between -1 and +1 the mean of the activations that come out of the hidden layer are close to having a zero mean, which makes learning for the next layer a little bit easier.
The only exception for using the sigmoid function is using it at the output layer at binary classification problems while using the Relu function at the hidden layers. Because when you want to predict either 0 or 1 it makes sense that y-hat should be between 0 and 1 and not between -1 and +1.
rectified linear unit (relu)
Another very popular activation function within machine learning is the Rectified Linear Unit function which is also just called relu.
The derivative is 1 as long as z (a point at the x-axes) is positive and the derivative is 0 when z is negative.
If your not sure which function to use for your hidden layer then the rely function is a good choice but be aware of the fact that there are no perfect guidelines about which function to use because your data and your problems will always be very unique. Choosing the right one is more of an art than a science. Consequently you should try things out if your not very sure.
leaky rectified linear unit
The leaky relu function is a slightly changed version of the relu function. Instead of the slope being zero when z is equal to 1 the function has a slight slope.
This works a bit better most of the time but isn’t used that much in practice.
An advantage of both functions is that for a lot of the space of z the slope of the activation function is very different to zero which let’s you neural network work much faster.
Why do you need non-linear activation functions?
If we use a linear activation function at the hidden layers our neural networks just outputs a linear function of the input. That will happen no matter how many layers a neural network has. This then makes a neural network no more better than logistic regression.
The key takeaway for you should be that linear activation functions within hidden layers are more or less useless except some very special cases.
One case where you could use it, is if you are working at a regression problem where y is a real number, like predicting the prices of houses. But only at the output layer, the hidden layers should use non-linear functions. You can see an example at the picture below.
Nevertheless even then you could use a relu instead of a linear function at the output layer with the same result. This is one of the reasons why a sigmoid function is rarely used nowadays.