After a prolonged winter, artificial intelligence is experiencing a scorching summer mainly thanks to advances in deep learning and artificial neural networks. To be more precise, the renewed interest in deep learning is largely due to the success of convolutional neural networks (CNNs), a neural network structure that is especially good at dealing with visual data.
But what if I told you that CNNs are fundamentally flawed? That was what Geoffrey Hinton, one of the pioneers of deep learning, talked about in his keynote speech at the AAAI conference, one of the main yearly AI conferences.
Hinton, who attended the conference with Yann LeCun and Yoshua Bengio, with whom he constitutes the Turin Award–winning “godfathers of deep learning” trio, spoke about the limits of CNNs as well as capsule networks, his masterplan for the next breakthrough in AI.
As with all his speeches, Hinton went into a lot of technical details about what makes convnets inefficient—or different—compared to the human visual system. Following is some of the key points he raised. But first, as is our habit, some background on how we got here and why CNNs have become such a great deal for the AI community.
Solving computer vision
Since the early days of artificial intelligence, scientists sought to create computers that could see the world like humans. The efforts have led to their own field of research collectively known as computer vision.
Early work in computer vision involved the use of symbolic artificial intelligence, software in which every single rule must be specified by human programmers. The problem is, not every function of the human visual apparatus can be broken down in explicit computer program rules. The approach ended up having very limited success and use.
A different approach was the use of machine learning. Contrary to symbolic AI, machine learning algorithms are given a general structure and unleashed to develop their own behavior by examining training examples. However, most early machine learning algorithms still required a lot of manual effort to engineers the parts that detect relevant features in images.
Classic machine learning approaches involved lots of complicated steps and required the collaboration of dozens of domain experts, mathematicians, and programmers
Convolutional neural networks, on the other hand, are end-to-end AI models that develop their own feature-detection mechanisms. A well-trained CNN with multiple layers automatically recognizes features in a hierarchical way, starting with simple edges and corners down to complex objects such as faces, chairs, cars, dogs, etc.
CNNs were first introduced in 1980s by LeCun, then a postdoc research associate in Hinton’s lab in University of Toronto. But because of their immense compute and data requirements, they fell by the wayside and gained very limited adoption. It took three decades and advances in computation hardware and data storage technology for CNNs to manifest their full potential.
Today, thanks to the availability of large computation clusters, specialized hardware, and vast amounts of data, convnets have found many useful applications in image classification and object recognition.
Each layer of the neural network will extract specific features from the input image.
The difference between CNNs and human vision
“CNNs learn everything end to end. They get a huge win by wiring in the fact that if a feature is good in one place, it’s good somewhere else. This allows them to combine evidence and generalize nicely across position,” Hinton said in his AAAI speech. “But they’re very different from human perception.”
One of the key challenges of computer vision is to deal with the variance of data in the real world. Our visual system can recognize objects from different angles, against different backgrounds, and under different lighting conditions. When objects are partially obscured by other objects or colored in eccentric ways, our vision system uses cues and other pieces of knowledge to fill in the missing information and reason about what we’re seeing.
Creating AI that can replicate the same object recognition capabilities has proven to be very difficult.
“CNNs are designed to cope with translations,” Hinton said. This means that a well-trained convnet can identify an object regardless of where it appears in an image. But they’re not so good at dealing with other effects of changing viewpoints such as rotation and scaling.
One approach to solving this problem, according to Hinton, is to use 4D or 6D maps to train the AI and later perform object detection. “But that just gets hopelessly expensive,” he added.
For the moment, the best solution we have is to gather massive amounts of images that display each object in various positions. Then we train our CNNs on this huge dataset, hoping that it will see enough examples of the object to generalize and be able to detect the object with reliable accuracy in the real world. Datasets such as ImageNet, which contains more than 14 million annotated images, aim to achieve just that.
“That’s not very efficient,” Hinton said. “We’d like neural nets that generalize to new viewpoints effortlessly. If they learned to recognize something, and you make it 10 times as big and you rotate it 60 degrees, it shouldn’t cause them any problem at all. We know computer graphics is like that and we’d like to make neural nets more like that.”
In fact, ImageNet, which is currently the go-to benchmark for evaluating computer vision systems, has proven to be flawed. Despite its huge size, the dataset fails to capture all the possible angles and positions of objects. It is mostly composed of images that have been taken under ideal lighting conditions and from known angles.
This is acceptable for the human vision system, which can easily generalize its knowledge. In fact, after we see a certain object from a few angles, we can usually imagine what it would look like in new positions and under different visual conditions.
But CNNs need detailed examples of the cases they need to handle, and they don’t have the creativity of the human mind. Deep learning developers usually try to solve this problem by applying a process called “data augmentation,” in which they flip the image or rotate it by small amounts before training their neural networks. In effect, the CNN will be trained on multiple copies of every image, each being slightly different. This will help the AI better generalize over variations of the same object. Data augmentation, to some degree, makes the AI model more robust.
But data augmentation won’t cover corner cases that CNNs and other neural networks can’t handle, such as an upturned chair, or a crumpled t-shirt lying on a bed. These are real-life situation can’t be achieved with pixel manipulation.
ImageNet vs reality: In ImageNet (left column) objects are neatly positioned, in ideal background and lighting conditions. In the real world, things are messier (source: objectnet.dev)
There have been efforts to solve this generalization problem by creating computer vision benchmarks and training datasets that better represent the messy reality of the real world. But while they will improve the results of current AI systems, they don’t solve the fundamental problem of generalizing across viewpoints. There will always be new angles, new lighting conditions, new colorings, and poses that these new datasets don’t contain. And those new situations will befuddle even the largest and most advanced AI system.
Differences can be dangerous
From the points raised above, it is obvious that CNNs recognize objects in a way that is very different from humans. But these differences are not limited to weak generalization and the need for many more examples to learn an object. The internal representations that CNNs develop of objects are also very different from that of the biological neural network of the human brain.
How does this manifest itself? “I can take an image and a tiny bit of noise and CNNs will recognize it as something completely different and I can hardly see that it’s changed. That seems really bizarre and I take that as evidence that CNNs are actually using very different information from us to recognize images,” Hinton said in his keynote speech at the AAAI Conference.
These slightly modified images are known as “adversarial examples,” and are a hot area of research in the AI community.
Adversarial examples can cause neural networks to misclassify images while appearing unchanged to the human eye
“It’s not that it’s wrong, they’re just doing it in a very different way, and their very different way has some differences in how it generalizes,” Hinton says.
But many examples show that adversarial perturbations can be extremely dangerous. It’s all cute and funny when your image classifier mistakenly tags a panda as a gibbon. But when it’s the computer vision system of a self-driving car missing a stop sign, an evil hacker bypassing a facial recognition security system, or Google Photos tagging humans as gorillas, then you have a problem.
There have been a lot of studies around detecting adversarial vulnerabilities and creating robust AI systems that are resilient against adversarial perturbations. But adversarial examples also bear a reminder: Our visual system has evolved over generations to process the world around us, and we have also created our world to accommodate our visual system. Therefore, as long as our computer vision systems work in ways that are fundamentally different from human vision, they will be unpredictable and unreliable, unless they’re supported by complementary technologies such as lidar and radar mapping.
Coordinate frames and part-whole relationships are important
Another problem that Geoffrey Hinton pointed to in his AAAI keynote speech is that convolutional neural networks can’t understand images in terms of objects and their parts. They recognize them as blobs of pixels arranged in distinct patterns. They do not have explicit internal representations of entities and their relationships.
“You can think of CNNs as you center of various pixel locations and you get richer and richer descriptions of what is happening at that pixel location that depends on more and more context. And in the end, you get such a rich description that you know what objects are in the image. But they don’t explicitly parse images,” Hinton said.
Our understanding of the composition of objects help us understand the world and make sense of things we haven’t seen before, such as this bizarre teapot.
Decomposing an object into parts helps us understand its nature. Is this a toilet bowl or a teapot? (Source: Smashing lists)
Also missing from CNNs are coordinate frames, a fundamental component of human vision. Basically, when we see an object, we develop a mental model about its orientation, and this helps us to parse its different features. For instance, in the following picture, consider the face on the right. If you turn it upside down, you’ll get the face on the left. But in reality, you don’t need to physically flip the image to see the face on the left. Merely mentally adjusting your coordinate frame will enable you to see both faces, regardless of the picture’s orientation.
“You have a completely different internal percept depending on what coordinate frame you impose. Convolutional neural nets really can’t explain that. You give them an input, they have one percept, and the percept doesn’t depend on imposing coordinate frames. I would like to think that that is linked to adversarial examples and linked to the fact that convolutional nets are doing perception in a very different way from people,” Hinton says.
Taking lessons from computer graphics
One very handy approach to solving computer vision, Hinton argued in his speech at the AAAI Conference, is to do inverse graphics. 3D computer graphics models are composed of hierarchies of objects. Each object has a transformation matrix that defines its translation, rotation, and scale in comparison to its parent. The transformation matrix of the top object in each hierarchy defines its coordinates and orientation relative to the world origin.
For instance, consider the 3D model of a car. The base object has a 4×4 transformation matrix that says the car’s center is located at, say, coordinates (X=10, Y=10, Z=0) with rotation (X=0, Y=0, Z=90). The car itself is composed of many objects, such as wheels, chassis, steering wheel, windshield, gearbox, engine, etc. Each of these objects have their own transformation matrix that define their location and orientation in comparison to the parent matrix (center of the car). For instance, the center of the front-left wheel is located at (X=-1.5, Y=2, Z=-0.3). The world coordinates of the front-left wheel can be obtained by multiplying its transformation matrix by that of its parent.
Some of these objects might have their own set of children. For instance, the wheel is composed of a tire, a rim, a hub, nuts, etc. Each of these children have their own transformation matrices.
Using this hierarchy of coordinate frames makes it very easy to locate and visualize objects regardless of their pose and orientation or viewpoint. When you want to render an object, each triangle in the 3D object is multiplied by its transformation matrix and that of its parents. It is then oriented with the viewpoint (another matrix multiplication) and then transformed to screen coordinates before being rasterized into pixels.
“If you say [to someone working in computer graphics], ‘Could you show me that from another angle,’ they won’t say, ‘Oh, well, I’d like to, but we didn’t train from that angle so we can’t show it to you from that angle.’ They just show it to you from another angle because they have a 3D model and they model a spatial structure as the relations between parts and wholes and those relationships don’t depend on viewpoint at all,” Hinton says. “I think it’s crazy not to make use of that beautiful structure when dealing with images of 3D objects.”
Capsule networks, Hinton’s ambitious new project, try to do inverse computer graphics. While capsules deserve their own separate set of articles, the basic idea behind them is to take an image, extract its objects and their parts, define their coordinate frames, and create a modular structure of the image.
Capsule networks are still in the works, and since their introduction in 2017, they have undergone several iterations. But if Hinton and his colleagues succeed to make them work, we will be much closer to replicating the human vision.