It is frustrating trying to learn about machine learning. Do I use YOLO, Keras, Tensorflow, PyTorch or all of them together somehow? And even if you figure out the PhD stuff, you still have to then master about three other disciplines to get it to work in production; devops, programming, and counting disciplines.
Well I am here for you brothers and sisters in computers, I have good news for your peace of mind.
The most important thing in getting machine learning to work properly is training data.
And even more importantly, the skills required to make a good training set have nothing to do with math, computers, or engineering.
What is training data?
The tools to take that data and turn it into a machine learning model that works exist today and are easy to use. You don’t have to go Berkeley to use them, you just have to be a little familiar with computers.
Is it obvious where the magic is yet?
The Magic.
Where did the 2000 examples e-mails come from? YOU! That’s right, anyone can create a dataset with just a bit of elbow grease, some focus, and the ability to follow a few simple conventions that I’ll outline below after I finish making this awesome point.
Sure, you could go online and find an existing dataset of spam, but if you’re actually try to solve a problem at your company, you ought to create your own dataset that comes from the problem you are trying to solve.
For example, perhaps your company gets inundated with e-mails that you’d consider spam but don’t fall under the traditional classification of such that perhaps GMAIL uses today.
Well, all you need to do is go through your companies e-mails and put examples of spam into one folder, and examples of not spam into another.
I agree, it is a bit tedious, but it is worth it in the end because what you are doing is GOLD! YOU ARE BASICALLY MAKING GOLD!!
It is practically a miracle what having a human-curated, problem-specific data set will do for your machine learning model’s accuracy. Companies would kill* for such a dataset.
To summarize, the reasons for this are two-fold;
- Machine learning models perform best when they’re trained for a specific problem, and aren’t being too generalized. It is like trying to use a hammer on a screw instead of a screwdriver. A human thinking about the problem they’re trying to solve, and building a curated training set around that is the best way to have success with your model.
- There is less and less room for generalized machine learning models to make a difference for businesses, and so they need to build models based on their own datasets, which some companies might not have or don’t know how to obtain.
What about visual data?
It is the same thing. There is an explosion in data labeling startups for visual imagery because of all of what I excellently mentioned above. It really requires the same work as finding spam e-mails; go through your images and label them.
Now I get it — it sounds incredibly tedious, and it is tedious, which is why you can hire third-party companies to do the labeling for you, and I know plenty of good ones out there that I’ve worked with in the past.
But on smaller scales, you can do this yourself, and become a machine learning master! Here are some simple guidelines to get started;
Guidelines
- For classification, you must have a balanced dataset. That means, if you have 1000 spam e-mails, you also need 1000 not-spam e-mails. The tolerance for this is minimum to none, meaning don’t try and get away with 1300 spam and 700 not-spam — I’ll know
- You must have a very clean dataset. That means, don’t have spam e-mails labeled as not-spam and vice versa. Incorrect labels will throw your model off by a surprising amount. I’ve seen accuracy increase by 20% or more after a thorough dataset cleaning.
- Be random. If you can, select a random selection of spam and regular e-mails. Don’t just start at the beginning of your e-mail history and work your way backward. If you do this, you could be training a model on what spam USED to look like, not what it looks like today.
- Context matters. If you’re training a model to detect people in your office lobby from your security camera, use example images from the same camera, some with people and others without. You must train the model on what is NOT people as much as you are training it on what is people. Does that make sense? If not… well… I dunno this is a blog post so I can’t really answer your questions directly. Moving on…
- Consistency matters. When I was building a fake news dataset, I kept having to start over because I kept changing my mind about what constitutes fake news; satire? blog posts? opinion pieces? Draw the line in the sand somewhere and stick to it throughout the whole process.
- Validation. Set aside 20% of your training dataset and don’t train your model with it. Use the 80% to train the model, and the 20% to then test your model to see how well it is performing. And make sure you choose that 20% randomly.
Conclusions
Good datasets matter. Practice making them, play around with the tools to see what kind of accuracy you get, or just keep this all in mind when you call a data labeling company to label your data for you.