Ready to learn Artificial Intelligence? Browse courses like Uncertain Knowledge and Reasoning in Artificial Intelligence developed by industry thought leaders and Experfy in Harvard Innovation Lab.
A closer look into the magic of Deep Learning
Imagine to be part of the biggest selfie ever, with millions of people in it and imagine to be able to identify a specific face…in less than 5 seconds. In this sentence the unconceivable part is about fitting millions of faces into a picture, not the ability to recognize a face in few seconds out of millions of faces. This is already doable, is already in use and it’s one of the many amazing AI related abilities provided by Deep Learning.
Deep learning is a technique that, as many other AI related models, is inspired by our natural brain. In particular, Neural Networks, which are the underlying architecture of Deep Learning, are loosely analogous to biological neurons, albeit greatly simplified, and the connections between nodes can be thought of as in some way reflecting connections between neurons.
The cell body of a neuron receives electrical signals from numerous tree-like receptive networks of nerve fibers called dendrites. The cell body has a nucleus, which sum all incoming signals and produce an integrated potential that, if exceeds a threshold, activate the portion of the cell that carries nerve impulses away from it, called axon, which ‘fires’ a spike. The axon is connected to the other neurons through a point of contact called synapse. Based on different factors, such as its geometry and the type of neurotransmitter, each synapse will increase or decrease the potential value received by the axon inhibiting or exciting the generation of pulses in the afferent neuron contained in its pathway, initiating the whole sequence of events again.
Similarly, an artificial neural network (ANN) consists of many highly connected artificial neurons arranged in layers. A signal enters the ANN from the input layer and exits to an output layer after going through a number of so-called hidden layers in between, that progressively process and simplify the original signal in order to achieve the wanted output.
Thanks to a growing availability of computing processing power, neural networks have increased the number of hidden layers they can handle – thus the phrasing “deep” to describe the development of large artificial neural networks capable of processing an incredible amount of information.
Among the most popular types of neural networks today, if not the most popular, there are the Convolutional Neural Networks (CNN) that in 2012 achieved state-of-the-art results on the object recognition challenge known as ImageNet, blowing away all other solutions based on different approaches. The event created a big hype also thanks to a series of highly influential papers such as Krizhevsky, Sutskever and Hinton’s 2012 “ImageNet Classification with Deep Convolutional Neural Networks”.
Convolutional Neural Networks approach is also inspired by the human brain and in particular, by the visual system which accounts approximately for half of the entire human neocortex. Our incredible ability to perceive and contextualize information is achieved by processing everything we see in a hierarchical way, starting from simpler features of the image to more complex ones all the way to object recognition and classification of objects into categories. The neurons in the early visual areas extract simple image features over small local regions of visual space. As the information gets transmitted to higher visual areas, neurons respond to increasingly complex features. With higher levels of information processing the representations become more invariant – less sensitive to the exact feature size, rotation or position. This hierarchical architecture of the primate visual system has inspired computer scientists to create models of artificial neural networks like the CNN that would also feature several layers where each of them creates higher generalizations of the input data.
By feeding a CNN with a picture that contains one or more objects that we want it to recognize, the network will start an identification process that as a final result will produce a value to indicate the probability that the particular object we are looking for, being that a cat, a car, a face or simply a letter of the alphabet, exists in the original input image.
Let’s suppose to have a picture of a digit and we want our network to recognize which number it is. The input signal to our neural network could be a matrix of the values of each pixel of the picture and each neuron of the first input layer would hold one of these values.
In a second layer we could apply different filters, which are also matrixes of values, that allows to isolate and identify specific patterns (features) within the original picture. For example, thanks to these filters, we could be able to identify patterns of straight lines or curved ones, which is already a step forward in the classification of the number since 1,4 and 7 should not contain curved lines; 3,8 and 0 should not contain straight lines and 2,5,6 and 9 are a combination of the two patterns. This filtering technique, called Convolution, will create a stack of filtered images or Feature Maps, each one containing a specific recognized pattern, which will be given as input to the subsequent layer of neurons.
We could then use an additional layer to simplify the information arriving from the previous one as the input images can be very large and contain millions of pixels values. With some degree of simplification, this objective can be achieved by summarizing the outputs of small matrixes of neighboring groups of neurons and selecting the highest value contained in each of the matrix. The extracted values will constitute a new, much smaller and furtherly filtered image, which also makes the features detected in the previous layer, a bit more robust and a little less sensitive to position allowing for some shift of the feature within the original image. This technique is called Max Pooling and it is largely used in Convolutional Networks.
While the role of the artificial neurons in the first input layer of the network is to host the value of each pixel of the original image, the role of the neurons in the hidden layers is to host the resulting value of a function that receives in input the values contained in the neurons of the preceding layer. The role of the connections between neurons is, similarly to the role of our biological synapsis, to associate a weight that represents the connection strength, or in other words, the importance of that specific connection and therefore the strength with which those values will be propagated further over subsequent layers. Important features or patterns that strongly characterize the object in the original image will be carried along layer after layer while the elements that are invariant in the determination of the object will lose weight along the process.
Besides Convolution and Pooling layers, there might be other layers where specific activation functions are applied to further simplify the math of the overall network by filtering out unnecessary values.
All these layers can be stacked one after the other in different orders and used as many time as needed to produce the most effective list of identified patterns within the original image.
Before the final output layer, Convolutional Networks implement one or more additional layers called Fully Connected Layers, where every value of each identified pattern is sequentially listed in a long array and every value becomes a vote which determines how strongly that particular element predicts the presence of the object we are looking for in the original image.
The result of a Neural Network computation is in fact a vote for each of the objects we want to detect in the picture, that gives the level of confidence that the particular object has been identified. If for example, we are configuring and training a network to recognize digits, a list of features which represents a straight vertical line will have high votes towards the number 1, while a list of features which represents two stacked circles will have high votes for an 8.
Of course, when it comes to identify a human face, the complexity of the imaging is much higher and the depth of the network could grow higher to handle it, but the basic principles remain the same.
For example, DeepFace, the deep learning facial recognition system created by Facebook which identifies human faces in digital images with 97% accuracy, employs a nine-layer neural net with over 120 million connection weights, and has been trained on four million images uploaded by Facebook users. To achieve that level of recognition accuracy, which close the gap to human-level performance in face verification, additional techniques have been used. In particular, the face alignment and frontalization pre-processing (which allows to rotate all input images frontally by applying a 3D model based on fiducial points) provide a normalized input allowing further optimization of the architecture of the neural network and improving the recognition algorithms by applying different sets of filters on different location in the feature map since different regions of an aligned image have different local statistics. For example, areas between the eyes and the eyebrows exhibit very different appearance and have much higher discrimination ability compared to areas between the nose and the mouth.
So far, we have surfed on the ‘Deep’ part of the ‘Deep Learning’ ocean…but what about the Learning side?
Deep Learning and the underlying neural network require a long training process during which the system has to learn how to autonomously identify an object. For this to happen, the training will require a huge dataset of training images related to the object from which it will learn, training cycle after training cycle, all the features that discriminate that object. A highly discriminating set of features will then determine a robust and generalized model for future qualitative predictions over images that have never seen before. Similarly to how it happens in the biological brain, the learning process will therefore strengthen the connections between neurons that carry the discriminating features by adjusting the weight value associated with each neuron.
So, in the most simple terms, the training phase is much about adjusting millions of weights each associated with a neuron in the network, until the image patterns that discriminate the object are correctly and consistently identified.
The majority of the deep learning use cases, implements an approach known as Supervised Learning. The supervised learning goal is to find a function that best maps a set of labelled inputs to their correct output. An example would be like our classification task, where the input is an image of letter, and the correct output is the name of the letter. It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process. We know the correct answers; the algorithm iteratively makes predictions on the training data and is corrected by the teacher. Learning stops when the algorithm achieves an acceptable level of performance.
A different approach is the Unsupervised Learning which gives its name to the fact that there are no correct answers and there is no teacher. The type of problems that are solved with the unsupervised learning approach are more related to clustering the input data to discover inherent groupings or finding ‘hidden’ associations between the data. The goal for unsupervised learning is to model the underlying structure or distribution in the data to learn more about the data.
Often, to solve real world problems, a combination of Supervised and Unsupervised learning approaches is used. This combined approach is called Semi-Supervised Learning and works well in situations where you need to train a system to learn to recognize an object but only some of the training images are labelled.
The first step in making a neural network ready to learn to recognize an object is to feed it with a big dataset of training images of the object. In supervised learning the images used for training are labelled with the correct name of the object they represent. The more examples you have the more reliably you will be able to train your network.
Once you start feeding the training images into the neural network, the initial prediction quality on what the object represent will be very poor. This is because the filters that are used to identify patterns in the input image and all the weights that are associated with each neuron in the network are initially not known and therefore are initialized with nearly random values. To improve the prediction, the output of the network is then compared with what the correct answer should have been and the difference, or the error, is used to adjust the weights values across the entire network.
The error, which indicates how poorly our initial model performed, is calculated by a Loss Function which then implements another algorithm, called Gradient Descent, to calculate the values of the adjustments, or gradients, needed to improve the prediction value toward the expected result and therefore minimize the loss, or the error.
All the adjustments (gradients) are calculated and applied backward from the output layer till the first one and for each neuron of each layer, that is why the overall process is called Backpropagation.
All the process is repeated for each training image in input and all the weight values are regulated slightly up or slightly down, as indicated by the Loss function and Gradient descent algorithm, to improve the prediction each time. The training will stop when the network reaches a solid level of prediction quality and all the training images are correctly classified.
The network is than tested on new images of the same object, to validate if the learned model is solid enough to be able to correctly classify objects in images that have never seen before.
Convolutional Neural Networks are not the only type of artificial network of course, but I’ve used them for my overview as they are implemented in the majority of the Deep Learning use cases of today. Their limits, besides the ones I’ve mentioned in my previous post, derives by their specific ability to deal with images. Convolutional Networks only capture local “spatial” patterns in data and if the data cannot be made to look like an image, they are less useful.
But other things can be represented as images, like for example an audio track can be broken into time steps and fed into a Convolutional network like a long sequence of frequencies waves bitmaps. Every bitmap could represent the wave form of the audio of a letter which combined with other in the right sequence, will produce a word which can then be combined in sentences and so on…that is why Convolutional Networks are also heavily used in Speech Recognition and Natural Language Processing use cases.
So, even considering their limits, Convolutional Networks are extremely powerful and are implemented in a vast spectrum of use cases…and when we’ll be able to take talking selfies, not only they will be able to recognize the faces, but also recognize who, among millions, shouted “Arrivederci !”.