In the next ten years, computer vision will make huge strides. In this article, we take a look at the trends and breakthroughs of the 2010s and what we can expect as we enter the 2020s.
I. A Short History of Computer Vision
Throughout the 80s, 90s, and 00s, computer vision was a notoriously difficult task. Even mediocre performance in a research lab was applauded in the community. And rightfully so – in those days the features used to train machine learning systems on visual tasks were manually designed in a process known as feature engineering.
What is feature engineering, you ask? It means we used our “expert” human intuition to design special tricks that would work on specific patterns within an image to create useful features for a learning computer. Over the years we accumulated many of these tricks, each with their own acronym: HOG, SIFT, ORB and even SURF. However, the unfortunate reality is that solving real world problems requires a carefully curated blend of these tricks. What you used to detect the divider line on a road was not what you would use for recognizing and distinguishing faces. The allure of building general purpose systems remained a distant dream.
II. Moving Beyond Feature Engineering
This drastically changed in the early 2010s, when we saw the biggest revolution in computer vision since the invention of computers themselves.
In 2012, a computer vision algorithm known as AlexNet achieved a 10% improvement over its competitors at the ImageNet Large Scale Visual Recognition Challenge. The world was shocked. The most amazing thing about it: the model used no hand-engineered features. Instead, the model relied on a general purpose learning system known as a neural network. AlexNet’s breakthrough had been to use a GPU (Graphics Processing Unit) to train the computer vision model significantly faster and for longer: AlexNet was trained over 6 days on two consumer-grade GPUs. For comparison OpenAI’s GPT3 released in 2020 trained on the simulated equivalent of 355 years costing roughly $4,600,000 to train.
Since AlexNet we have continued to add data points that are showing a clear and obvious pattern: the bigger the dataset, the bigger the model, and the longer we train for, the better our learning features become. For the first time we can now see a clear path to the general purpose intelligent systems we have always dreamed of.
III. The Roll-Out: Transformers, Mobilize
More recently, in the last couple years, we have seen a new breakthrough in vision algorithms with the emergence of transformers over convolution.
Transformers are a deep learning architecture based around an encoder and decoder which have been popular in natural language (NLP) tasks for some time now. Papers such as DETR out of Facebook’s AI Research group made waves when they showed how transformers could be used to get state of the art performance on vision tasks.
Transformers are simpler to implement than the currently popular computer vision algorithms (such as MaskRCNN) and represent yet another step in the direction of less human engineering in computer vision. The less time we spend developing and tuning these algorithms the more we can tackle increasingly complex tasks, making computer vision more accessible to more humans.
A huge ramification of this as we move into the next decade will be the opportunity to create transformer-friendly hardware that works for both vision and NLP tasks. Right now, there is much debate as to whether the intelligent agents (IoT cameras, Alexa and Google Home devices, etc.) will perform inference on the cloud or directly on the device itself. Is that little device just a dumb sensor sending signals to a specialized brain in the cloud, or is it a little general purpose learner using its own silicone to recognize your face and listen to your commands—perhaps preferable to privacy advocates, as the data never leaves teh device. Moreover, a more homogeneous landscape of model architectures will have repercussions as to whether the edge beats the cloud.
IV. Data Power and the Synthetic Data for Computer Vision
We’ve talked about algorithms and hardware. We now fall on the most important piece of the AI puzzle: data.
The historic trends show us the following: one, algorithms are becoming more generic, and two, the guard rails of human engineering become smaller. The consequence of this is that the performance of computer vision is more dependent on the data used to train it. This should not come as a surprise, we all see the tech giants amassing huge datasets.
However, getting huge datasets is not the answer to more powerful AI. Because these data sets, whether they’re scraped from the internet or painstakingly staged and captured in house, are not the best to train more generic autonomous algorithms. This “real data” allows for all the bias of the real world to inevitably creep into the computer vision algorithms. Further, real data is not easily fed into training: it needs to be cleaned, labelled, annotated, and fixed.
So, we find ourselves poised at the precipice of a technological turn as significant as the introduction of neural nets and transformers. Data is the big hurdle holding back computer vision. And the solution, we would argue, is synthetic data. A quick definition: synthetic data is data created and generated by a computer (think video games or the CGI you see in movies). Full control over this virtual world means pixel perfect labels (think metadata such as which pixels correspond to a face in an image), even labels which may be impossible to label in real world datasets.
Synthetic data is still in the early days. Much like feature engineering in the 2010s, each synthetic dataset is currently designed by hand using human intuition. But as we speak (or read, as it were) startups (including us!) are building the systems which will allow us to generate infinite streams of synthetic data which are designed by learning systems themselves.
Automated synthetic data generation, or as we like to think of it, the advent of a generative platform for synthetic data sets, will be a game changer for computer vision. A decade from now, computer vision algorithms will be constantly improving through a process known as lifelong learning. The model will recognize its weaknesses, generate new synthetic data for that weakness, and train on that dataset. The best part: this will all be automated. An invisible process running on hordes of GPUs somewhere in the cloud.
That’s what we can expect as we enter the 2020s: it is about data, and more specifically, synthetic data. This is what will optimize and enable more complex computer vision.