Human-level performance. Human-level accuracy. Those are terms you hear a lot from companies developing artificial intelligence systems, whether it’s facial recognition, object detection, or question answering. And to their credit, the recent years have seen many great products powered by AI algorithms, mostly thanks to advances in machine learning and deep learning.
But many of these comparisons only take into account the end-result of testing the deep learning algorithms on limited data sets. This approach can create false expectations about AI systems and yield dangerous results when they are entrusted with critical tasks.
In a recent study, a group of researchers from various German organizations and universities have highlighted the challenges of evaluating the performance of deep learning in processing visual data. In their paper, titled, “The Notorious Difficulty of Comparing Human and Machine Perception,” the researchers highlight the problems in current methods that compare deep neural networks and the human vision system.
In their research, the scientist conducted a series of experiments that dig beneath the surface of deep learning results and compare them to the workings of the human vision system. Their findings are reminder that we must be cautious when comparing AI to humans, even if it shows equal or better performance on the same task.
The complexity of human and computer vision
In the seemingly endless quest to reconstruct human perception, the field that has become known as computer vision, deep learning has so far yielded the most favorable results. Convolutional neural networks (CNN), an architecture often used in computer vision deep learning algorithms, are accomplishing tasks that were extremely difficult with traditional software.
However, comparing neural networks to the human perception remains a challenge. And this is partly because we still have a lot to learn about the human vision system and the human brain in general. The complex workings of deep learning systems also compound the problem. Deep neural networks work in very complicated ways that often confound their own creators.
In recent years, a body of research has tried to evaluate the inner workings of neural networks and their robustness in handling real-world situations. “Despite a multitude of studies, comparing human and machine perception is not straightforward,” the German researchers write in their paper.
In their study, the scientists focused on three areas to gauge how humans and deep neural networks process visual data.
How do neural networks perceive contours?
The first test involves contour detection. In this experiment, both humans and AI participants must say whether an image contains a closed contour or not. The goal here is to understand whether deep learning algorithms can learn the concept of closed and open shapes, and whether they can detect them under various conditions.
“For humans, a closed contour flanked by many open contours perceptually stands out. In contrast, detecting closed contours might be difficult for DNNs as they would presumably require a long-range contour integration,” the researchers write.
For the experiment, the scientists used the ResNet-50, a popular convolutional neural network developed by AI researchers at Microsoft. They used transfer learning to finetune the AI model on 14,000 images of closed and open contours.
They then tested the AI on various examples that resembled the training data and gradually shifted in other directions. The initial findings showed that a well-trained neural network seems to grasp the idea of a closed contour. Even though the network was trained on a dataset that only contained shapes with straight lines, it could also performed well on curved lines.
“These results suggest that our model did, in fact, learn the concept of open and closed contours and that it performs a similar contour integration-like process as humans,” the scientists write.
However, further investigation showed that other changes that didn’t affect human performance degraded the accuracy of the AI model’s results. For instance, changing the color and width of the lines caused a sudden drop in the accuracy of the deep learning model. The model also seemed to struggle with detecting shapes when they became larger than a certain size.
The neural network was also very sensitive to adversarial perturbations, carefully crafted changes that are imperceptible to the human eye but cause disruption in the behavior of machine learning systems.
To further investigate the decision-making process of the AI, the scientists used a Bag-of-Feature network, a technique that tries to localize the bits of data that contribute to the decision of a deep learning model. The analysis proved that “there do exist local features such as an endpoint in conjunction with a short edge that can often give away the correct class label,” the researchers found.
Can machine learning reason about images?
The second experiment tested the abilities of deep learning algorithms in abstract visual reasoning. The data used for the experiment is based on the Synthetic Visual Reasoning Test (SVRT), in which the AI must answer questions that require understanding of the relations between different shapes in the picture. The tests include same-different tasks (e.g., are two shapes in a picture identical?) and spatial tasks (e.g., is the smaller shape in the center of the larger shape?). A human observer would easily solve these problems.
For their experiment, the researchers use the ResNet-50 and tested how it performed with different sizes of training dataset. The results show that a pretrained model finetuned on 28,000 samples performs well both on same-different and spatial tasks. (Previous experiments trained a very small neural network on a million images.) The performance of the AI dropped as the researchers reduced the number of training examples, but degradation in same-different tasks was faster.
“Same-different tasks require more training samples than spatial reasoning tasks,” the researchers write, adding, “this cannot be taken as evidence for systematic differences between feed-forward neural networks and the human visual system.”
The researchers note that the human visual system is naturally pre-trained on large amounts of abstract visual reasoning tasks. This makes it unfair to test the deep learning model on a low-data regime, and it is almost impossible to draw solid conclusions about differences in the internal information processing of humans and AI.
“It might very well be that the human visual system trained from scratch on the two types of tasks would exhibit a similar difference in sample efficiency as a ResNet-50,” the researchers write.
Measuring the recognition gap of deep learning
The recognition gap is one of the most interesting tests of visual systems. Consider the following image. Can you tell what it is without scrolling further down?
Below is the zoomed-out view of the same image. There’s no question that it’s a cat. If I showed you a close-up of another part of the image (perhaps the ear), you might have had a greater chance of predicting what was in the image. We humans need to see a certain amount of overall shapes and patterns to be able to recognize an object in an image. The more you zoom in, the more features you’re removing, and the harder it becomes to distinguish what is in the image.
Deep learning systems also operate on features, but they work in subtler ways. Neural networks sometimes the find minuscule features that are imperceptible to the human eye but remain detectable even when you zoom in very closely.
In their final experiment, the researchers tried to measure the recognition gap of deep neural networks by gradually zooming in images until the accuracy of the AI model started to degrade considerably.
Previous experiments show a large difference between the image recognition gap in humans and deep neural networks. But in their paper, the researchers point out that most previous tests on neural network recognition gaps are based on human-selected image patches. These patches favor the human vision system.
When they tested their deep learning models on “machine-selected” patches, the researchers obtained results that showed a similar gap in humans and AI.
“These results highlight the importance of testing humans and machines on the exact same footing and of avoiding a human bias in the experiment design,” the researchers write. “All conditions, instructions and procedures should be as close as possible between humans and machines in order to ensure that all observed differences are due to inherently different decision strategies rather than differences in the testing procedure.”
Closing the gap between AI and human intelligence
As our AI systems become more complex, we will have to develop more complex methods to test them. Previous work in the field shows that many of the popular benchmarks used to measure the accuracy of computer vision systems are misleading. The work by the German researchers is one of many efforts that attempt to measure artificial intelligence and better quantify the differences between AI and human intelligence. And they draw conclusions that can provide directions for future AI research.
“The overarching challenge in comparison studies between humans and machines seems to be the strong internal human interpretation bias,” the researchers write. “Appropriate analysis tools and extensive cross checks – such as variations in the network architecture, alignment of experimental procedures, generalization tests, adversarial examples and tests with constrained networks – help rationalizing the interpretation of findings and put this internal bias into perspective. All in all, care has to be taken to not impose our human systematic bias when comparing human and machine perception.”