In case you’ve been living under a rock, artificial intelligence (AI) is everywhere. It’s infiltrated almost every aspect of our private and professional lives. From healthcare to transportation, AI aims to redefine how information is collected, integrated, and analyzed; ultimately leading to more informed insights and delivering better outcomes. But for all its hype, the full promise of AI rarely comes to fruition because of one four-letter word: “data.”
While the AI story is all the rage, the data narrative is not as prominently discussed. Sure, data may not be as sexy as the automated systems that can learn and process information quicker than a human, but it is equally as important. And don’t get me wrong, we all know that AI requires vast amounts of data to continually learn and identify patterns that humans can’t. After all, it’s the ability to process this information and make instant decisions that has led to AI being such a game changer for industries that rely on massive volumes of data.
But the real story is not about the algorithms powering the AI revolution, instead it’s about the quality of data powering these systems. What enterprises really need as they develop their AI strategy is to integrate, clean, link, and supplement their data so they have an accurate foundation on which to build and train their machine learning algorithms. For many organizations, this makes AI difficult if not impossible.
“Data-related challenges are a top reason (our) clients have halted or canceled artificial-intelligence projects,” said IBM’s senior vice president of cloud and cognitive software, Arvind Krishna, speaking at The Wall Street Journal’s Future of Everything Festival. He’s certainly not alone in his assessment. According to a report by MIT Technology Review, insufficient data quality was one of the biggest challenges to employing AI. What’s more, 85% of AI projects will “not deliver” for organizations, according to research and advisory company Gartner.
Companies need to think of AI and machine learning as the engines that will drive the amazing things they want to accomplish. But like every engine, it needs the right fuel to run well.
Enter Data Annotation
Data annotation (also referred to as data labeling) is quite critical to ensuring your AI and machine learning projects can scale. It provides that initial setup for training a machine learning model with what it needs to understand and how to discriminate against various inputs to come up with accurate outputs.
There are many different types of data annotation modalities, depending on what kind of form the data is in. It can range from image and video annotation, text categorization, semantic annotation, and content categorization. Humans are needed to identify and annotate specific data so machines can learn to identify and classy information. Without these labels, the machine learning algorithm will have a difficult time computing the necessary attributes.
The unfortunate reality about all of this is that it’s still a very manual process requiring manual labor. While tools for annotation are getting better, the difference between an ill-designed tool and an intuitive one makes significant difference in annotation productivity. According to some estimates, 80% of AI project time is currently spent on data preparation. But even small errors in the data could prove to be disastrous. In this area, humans actually have a leg up on machines. We’re are simply better than computers at managing subjectivity, understanding intent, and coping with ambiguity – all of which are important factors of data annotation.
Regardless of modality, the vast majority of problems in which AI models are being built to address them can fit into one (or many) of the below annotation tasks:
- Sequencing: text or time series from which there’s a start (left boundary) an end (right boundary) and a label. (e.g., recognize the name of a person in a text, identify a paragraph discussing penalties in a contract)
- Categorization: binary classes, multiple classes, one label, multi-labels, flat or hierarchic, otologic (e.g., categorize a book according to the BISAC ontology, categorize an image as offensive or not offensive)
- Segmentation: find paragraph splits, find an object in image, find transitions between speakers, between topics, etc. (e.g., spot objects and people in a picture, find the transition between topics in a news broadcast)
- Mapping: language-to-language, full text to summary, question to answer, raw data to normalized data (e.g., translate from French to English, normalize a date from free text to standard format)
Usually, complex problems can be solved as a sequence or a combination of tasks. For example, when you unlock your phone with face identification, machine learning is used to spot your nose and eyes (segmentation) and categorize as you or not-you (categorization). Think about when you talk to Alexa or Siri, machine learning is used to map your voice to words (mapping), recognize sequences such as instruction, name of a song, etc.(sequences) and play music, tell weather, etc. (categorization).
At the end of the day, even the most technically advanced algorithm cannot address or solve a problem without the right data. We know having access to data is quite valuable, but having access to data with a learnable ‘signal’ consistently added at a massive scale is the biggest competitive advantage nowadays. That’s the power of data annotation.