What is Deep Learning, Anyway?
As machine learning has gained significant notoriety for its wide-spread use across an immense number of applications, from retailers targeting products/marketing to individual consumers to high-frequency trading and quantitative models revolutionizing modern finance, and not to mention the seemingly constant media attention it gets in polemics surrounding privacy, user data, cybersecurity, etc., “machine-learning” has now become part of the mainstream vernacular. As such, we can assume that the reader has some general familiarity with machine learning and the problems it attempts to solve, which the layperson (with some help/prodding) might describe as: using quantitative methods on data to optimize predictions about some future/unknown events or phenomena. Even that general description is a bit too narrow, as making predictions is a task specific to “supervised” machine learning, which is distinguished only by the fact that there is something to predict in the data (a label(s) that algorithms can use to “supervise” the performance of their predictions and optimize them). It would be a gross oversight to exclude “unsupervised” learning from the machine learning umbrella, which has more to do with describing events/phenomena with quantitative methods on data (having no clear “labels” in the data to predict/supervise performance, but still wanting to understand/make inferences from features in the data like clusters, distributional properties, etc.). Notwithstanding is also “reinforcement learning” that falls under the general scope of machine learning tasks, which differs from both supervised/unsupervised learning in that (broadly speaking) the objective has more to do with “how” the machine makes decisions from data rather than the features/labels themselves (e.g., procedures for a machine to predict that a particular sequence of decisions/events will maximize some reward or minimize cost).
Deep learning is more specific to the AI tasks that it’s used for solving. Broadly speaking, deep learning performs supervised/unsupervised tasks with very large (“deep”) Artificial Neural Networks (ANNs), ostensibly replicating not only what a human would decide, but also how humans come to those decisions. Standard examples include mimicking the firing of a critical mass of neurons in the human brain with deep Multi-Layer Perceptrons (MLPs) in AI, or using Convolutional Neural Networks (CNNs) in AI to approximate how groups of neurons are stimulated as we focus/narrow our sight to a visual receptive field, or even mimicking how humans learn language as we experience a vast collection of words, phrases, or documents with AI that uses Natural Language Processing (NLP), word-embeddings, encoders, transformers, Long/Short Term Memory models (LSTMs/GRUs), and many others. While the human brain doesn’t actually process information and make decisions like these ANNs do, much like machine learning itself, the architecture of these deep-learning models was inspired by the way that humans think and observe processes.
Thus, before diving too deeply into what deep learning is and its technical characteristics, we have already stumbled onto our first point of “distinction” between machine/deep learning.
- Machine Learning is More General than Deep Learning
It would be wrong to think of machine learning and deep learning as “different.” Indeed, deep learning is a subset of machine learning. What we will attempt to do in this article is highlight key features that distinguish deep learning specifically from other machine learning procedures.
Since “deep learning” is synonymous with “deep neural networks,” we might start with what distinguishes deep learning from other, simple neural networks that are used in machine learning, which we will call shallow neural networks / shallow learning. This distinction must go beyond simply the number of layers or trainable parameters in the models, as this would inevitably create an arbitrary cut-off between deep/shallow learning.
Some of the distinction between deep learning and simpler neural networks should include the space of problems/experiments that the network seeks to solve. In particular, deep learning has been wildly successful for tasks in computer vision, listening, and language. That’s not to say that deep learning is restricted to this space of tasks, but there is huge literature in these areas on the optimal architectures and parameters for deep learning tasks. Much of the literature and publicly available models focus on supervised learning, particularly for classification problems (e.g., standard pre-trained models in libraries like Keras are used for image classification, or Google’s recently released a pre-trained “VGGish” model that mimics the VGG architecture from image classification to do audio processing, embedding and sound classification).
What distinguishes the set of problems associated with vision, language, and listening, and why deep learning has been so successful at modeling them, is the complexity of the input features associated with these tasks. This leads to the second property that characterizes deep learning: deep learning not only requires a neural network architecture, but the hidden layers of that ANN sequentially generate new representations of the features inputted to the model.
- Deep Learning Uses Input Tensors to Replace Traditional Feature Selection with Representation Learning
A fundamental property of deep-learning algorithms is the powerful and robust way deep learning automates many traditional feature engineering and feature selection procedures with representation learning. Deep learning provides a wholesale solution to the feature selection process, including approaches to manage problems of overfitting and dependent observations. Unlike traditional machine learning models, deep learning inputs are not limited to a single table, where each row is a vector of features for a particular individual/observation. Rather, the inputs to a deep learning model are tensors, general mathematical constructs that may have their own dependencies, geometry, feature relationships, etc. Thus, instead of a single row of data representing each individual for the model to learn, the individuals may themselves consist of complex, multidimensional tables.
For example, in image processing, representation learning is to say that the entire image, represented as a tensor with dimensions (264, 264, 3) of the RGB values, can be directly inputted to the model and learned by the algorithms. The inputs are large, and the model architecture is designed specifically to codify and learn all relevant features, dependencies, and relationships from the raw tensors through a complex and highly specialized sequence of learning nodes tailormade for that task.
Early on, CNNs were shown to be highly effective at capturing the short-term dependencies in image data, and they became the standard for deep learning vision tasks. Sliding “windows” scan the images and identify patterns in a manner that is designed to mimic the stimulation of neurons from our visual receptive field when we focus our sight. In CNNs these windows are different sized kernels and filters that identify signals/relationships within and across each color channel, respectively, and combinations of them can be customized to learn nearly any set of images. Large, standardized datasets like ImageNet and pre-trained models were created by researchers testing and competing for optimal learning architectures classifying those images, and nearly all of the top performers employ multiple CNNs in some form, like ResNet, VGG, Inception, etc. While other architectures have also been proven effective for these tasks, what distinguishes all of them is their ability to take in large collections of raw images and accurately represent them, detecting key features automatically.
This naturally leads us to our next distinguishing property of deep neural networks, which we have already noted and seems rather obvious: they are neural networks! Neural networks address how deep learning algorithms perform feature learning. The example of CNNs from computer vision will be helpful because representation learning always depends on the specific inputs for the deep learning task.
- Deep Learning Uses Large Neural Networks and Sequential Layers
The “deep” in deep learning refers to the large number of “hidden layers” that are common to their neural network architecture. Accurate representation learning typically requires very large depth for the neural network, often with dozens of layers, and thus, “deep neural networks” was coined.
The input to each layer in a neural network is either the raw input of initial features or an output from a previous layer of the model. Each layer of a deep neural network offers a new representation of the initial input features, with the goal of completely automating this feature learning, along with codifying and discriminating all key features and patterns. The input to the network may be a raw image, but the output of the final layer is a set of activated kernels/neurons of lower dimensionality that the ANN uses as a representation of the initial image.
So, how many layers are necessary for this representation to accurately codify all of the key characteristics of pictures, videos, sound files, or human language? It turns out that the answer is usually: quite a lot. Machine learning of any kind is notoriously prone to overfitting and bias, but there are powerful controls to aid in the generality of what’s learned. Additional layers can accomplish this task, like dropout layers that identify heavily weighted neurons/kernels and randomly drop them from the model, forcing the network to generalize and not “get stuck” on any one detail, or batch-normalization in order to normalize the data across different “batches” fed to the model. In the end, accurately learning key features of our data, with the ability to generalize those characteristics to understand features of new, similar data, typically requires a very large neural network.
Indeed, standard models that are publicly available for tasks, like computer vision and language processing, can be enormous. The VGG16 architecture (shown below, reproduced from Nash, et al., 2018) has 16 convolution/fully connected (dense) layers alone, not including the max-pooling and softmax layers, bringing the total to 22 network layers. Language models get even larger, with the largest BERT architecture having 24 encoder layers alone, with each encoder representing a block of attention layers/feedforward RNNs.
VGG16 (left) and BERT (right) Architectures
As one can imagine, deep learning requires an immense amount of data in order to generate sufficiently robust predictions. Fortunately, if such data is not available, the data scientist is not out of luck, as there are very robust and successful pre-trained models from publicly available packages or APIs that they can start with, and tune to their particular task. This leads right to our next distinguishing feature of deep learning.
- Deep Learning Uses Transfer Learning: Pre-Trained Layers and Fine Tuning
The importance of very large training/test samples for deep learning cannot be overstated. For things like computer vision, a single image may be transformed in dozens of different ways (rotations/reflections, adjusting pixel size/granularity, hue, contrast, color quality, etc.) to greatly expand sample sizes and learn all identifiable versions of that image. For computer language, vast collections of documents from web-crawl/cached websites, news, Wikipedia, etc. are used to give deep learners an ample supply of text and phrases to train large language models.
If a researcher or a working data scientist wishes to use deep learning for their own customized vision/language tasks, but lacks the time, the resources, or the extreme computing demands required for processing/training a model on such an enormous universe of data, hope is not lost. Deep learning models are so powerfully generalizable that data scientists can reliably turn to others who have already done much of that heavy lifting. Even when we need to classify data for new objects outside of the initial classification groups, pre-trained models and transfer learning are robust. It may seem absurd that a model trained on ImageNet to classify images of things like cats/dogs could be useful for tasks like tumor identification or retinal scans, but they give the data scientist a head start for their algorithms to quickly recognize what are not tumors or retinal images. Consider the classic “bird or the branch” bias problem, where a deep learner is tasked for animal classification, and trained/tested only on images of birds perched on tree branches. This model may seem to perform well at identifying birds, when in fact, it had only learned to detect features of the much simpler background branches than the object in question.
Pre-trained models can enable the data scientist to employ a very deep architecture that they could not have used otherwise by fine tuning the pre-trained model to their data. Unlike statistical models, neural networks can update model parameters incrementally without re-estimating all parameters from the full training sample (thanks to backpropagation). In this way, researchers can input their own data to a pre-trained model as a new set of epochs and tune the parameters of the representation layers in the deep architecture, and even add new classification categories (like tumors) with additional dense/softmax layers, for example.
It wasn’t until recently, with the advent of ULMFiT and BERT in 2018, that this kind of transfer learning was possible for language models. Here, the task is significantly more complicated, as words are learned from their use in context, which can vary for words that have multiple meanings or for words that are specific to technical/industry documents where they may be used in contexts never seen by the pretrained layers. Now, transfer learning and fine tuning are standard and important practices for deep learning tasks of all kinds.
- Deep Learning is Computationally Intensive and Uses Cloud/Distributed Programming
We couldn’t conclude an article on deep learning without pointing out the (often) extreme computational demands of training/tuning a deep neural network. Fortunately, for the data scientist who can test/deploy with a pre-trained model, the computational requirements are not as prohibitive, even with a few additional layers (dense/softmax) to customize to their own data. The challenge arises when one wishes to develop a very deep architecture from scratch, which can take days to run on some of the largest cloud environments/clusters. The aforementioned VGG16 has over 138 million trainable parameters, and BERT has over 110 million, so even simple fine tuning can take hours to run.
As such, training a deep learning model on any dataset that is sufficiently large/robust will almost surely require cloud architecture and multiple clusters. Fortunately, deep learning APIs like Keras/TensorFlow and pyTorch make training over multiple CPUs/GPUs relatively straightforward, and cloud providers like AWS, GCP, and Azure have their own tools and APIs to optimize training, like TPUs/gCloud’s AI Platform, AWS’ SageMaker/Deep Learning AMIs, Azure Machine Learning Notebooks, etc. One can hardly become a skilled practitioner of deep learning methods without also acquiring non-trivial skills as a cloud engineer with one or more of these tools.