Successful machine learning projects need good data — but in many projects, this can already be the first major hurdle. For example, image data from a camera may already be available, but so far no computer-readable data has been recorded, which should be automatically recognized in the future. Especially if you want to start with AI, the preceding work in data preparation can be overwhelming. Of course, this is particularly discouraging in cases where you only want to start the first attempt with a proof of concept for a defined area.
So you need a solution with which you can quickly and easily label your own raw data. We took a look at the tooling jungle and collected what the market has to offer and what to look out for. Let’s have a look at what are important questions to ask yourself before starting the labeling and the tools that are available to you.
Who should label the data?
Labeling is a very tedious task — especially if you want to provide a sufficient amount of data for a deep learning algorithm. If you don’t believe this, you should imagine what it is like to draw in a defect on components on hundreds of images with pixel-perfect accuracy. Unfortunately, a certain amount of labeled data is absolutely necessary. So where do you get this data from? Basically, there are the following scenarios:
- Labeling of the data by the domain expert: When it comes to your own data, as an expert you naturally know best how it should be labeled. In the beginning, you should decide what is a good format for storing the labels and set up a workflow with suitable tooling so that this can be done as quickly as possible for a data set.
- Labeling by the AI service provider: After suitable instruction and explanation, your AI partner is certainly also able to perform labeling. The advantage of this is that a good understanding of the basic data and its classes is directly conveyed. At the same time, this approach will be too expensive in most cases, since labeling itself is a less complex task.
- Third-party labeling: It is also possible to outsource the work completely. Cloud service providers such as Google also offer their own services for this purpose (https://cloud.google.com/ai-platform/data-labeling/docs). However, only certain formats (images, video, and text) are supported and very specific instructions must be formulated to achieve the right result. The documentation suggests doing several test runs until it works. The data itself must of course also be transferred to the cloud.
In the following, we will focus more on the first two scenarios, as in our experience they occur most frequently.
What kind of data can be labeled?
Although when thinking about labeled data and AI object recognition in images is the first thing that comes to mind, a wide variety of data types from different applications can be enriched with labels. The labels themselves can also be applied on different levels. For example, you can label an entire image, draw boxes in it and label them (bounding box segmentation), or even assign classes with pixel accuracy (semantic segmentation).
Besides images, there are of course other data formats that are well suited for processing — the most relevant among them are:
- Time-series: this includes, for example, the recordings from a machine control system or the sensor values that are saved during a production process.
- Audio: Data from the audio area can be used to identify speech or to analyze recordings picked up by microphones in any process.
- Text: Interesting for all UseCases from the area of natural language processing, such as chatbots or intelligent document management systems.
- Images: Data from cameras or optical sensors in general, which should support e.g. quality control at the end of production processes.
- Videos: Video recordings can stem from surveillance cameras and could for example be used to increase the safety of machine operation.
- 3D-Data: It is also conceivable that e.g. parts of a manufacturing model needs to be provided with labels.
As we will see later, the different data areas are supported by Tooling to varying degrees. However, besides the requirements for functionality, there are also other general conditions to consider.
What are the further requirements for a good labeling tool?
If you are working with an AI service provider and have sensitive company data there also further considerations interesting.
- License compliance: When using an external tool, it must be allowed to pass it on to customers for a limited time if they want to do the labeling themselves. On the other hand, this case can also occur for the customer if the AI service provider is to support the labeling process.
- Data security: If possible, we would like to avoid a cloud-based solution since we often handle sensitive data and do not want it to end up unnecessarily on the servers of the label suppliers.
- Comfort: The tool should also be intuitively operable by employees with little technical experience. This aspect also includes the time invest. It must always be faster to use the selected tool than to do it ‘manually’. The tool must also be easy to set up technically and be usable in as many environments as possible.
- Use case coverage: Optimally, one tool instead of five tools would be ideal. It should be a program that supports image segmentation but that can also handle time series classification.
- Costs: The tool used should not exceed the financial and time frame. In most cases, this can consideration is reflected in the comparison of free tools and the time saved by paid solutions.
Given those criteria, we are ready to have a look at the offered solutions!
What tools are there?
We looked around and tried to get an impression of the existing solutions on the market. We looked at both commercial and freely available software and tried to understand which use cases they cover. The commercial solutions are mostly based on cloud support and also offer additional functions besides labeling, such as the simultaneous training of AI algorithms or the support of external workforce. Most freely available alternatives usually require a command-line installation and are not available as out-of-the-box solutions (if you want to host them yourself).
This table is by far not complete and rather intended to give an overview of the current spectrum of solutions. We noticed that there is no real canonical solution for the labeling of 3D data (except for KNOSSOS which is a specialized tool for tissue data). So for your technical 3D data, you would have to do the labeling yourself with the tool of your choice (e.g. AutoCAD, Blender, …) and export it to the corresponding files.
All beginnings are difficult — as is the case when preparing raw data with labels for an initial AI project. However, one can rely on an ever-increasing support and tool landscape. For the first steps in this field, Label-Studio has convinced us the most, because it is quickly installed and easy to use. It also has a very broad support of different data types and advanced workflows if the need arises. We hope this article could give you a little insight into the world of labeling and enables you to take the next step on your personal AI journey. So don’t be shy and let’s go — from collecting data to labeling!
Here is the collection of links to the tools we covered:
- Hasty https://hasty.ai/solution.html
- DataGym https://www.datagym.ai/
- Labelbox https://labelbox.com/
- Google Cloud AI Labeling https://cloud.google.com/ai-platform/data-labeling/docs
- Cloudfactory https://www.cloudfactory.com/data-labeling
- UltimateLabeling https://github.com/alexandre01/UltimateLabeling
- labelme https://github.com/wkentaro/labelme
- labelImg https://github.com/tzutalin/labelImg
- Label-Studio https://github.com/heartexlabs/label-studio
- Curve https://github.com/baidu/Curve
- ELAN https://archive.mpi.nl/tla/elan