Ready to learn Artificial Intelligence? Browse courses like Uncertain Knowledge and Reasoning in Artificial Intelligence developed by industry thought leaders and Experfy in Harvard Innovation Lab.
In the spirit of Halloween, let’s focus on something really scary: over half of enterprise AI projects fail. They’re strategic, often board-driven, expensive and highly visible, and yet most of them flop. AI initiatives that never go into production cost people their jobs and reputations. When something goes wrong after the model is deployed, sometimes there are nasty headlines and the need for crisis management.
AI projects fail for many reasons but these common data training mistakes can significantly improve the odds of a project’s success when avoided or corrected.
Don’t ask your data scientists to prepare your training data
It’s not unusual for data scientists to collect and annotate the relatively small data sample required to prove an algorithm’s fundamental concept. At this early stage the data set is manageable. The team can control quality and data-based bias issues are easy to detect. But when the same team of expensive, highly skilled data scientists is expected to produce a full-blown training data set, AI initiatives can go down in flames.
With a difference in scale of two or three orders of magnitude, the task overwhelms a data science team. A data preparation exercise of this size requires tools for managing data items, labeling and annotation tasks, and people to whom the tasks are assigned. It requires a lot of hours of work that data scientists don’t have. And it requires specific skills for managing large projects, designing annotation and labeling tasks, auditing working output, and creating processes for verifying label and annotation accuracy, people and skills that small AI teams don’t have.
Don’t buy pre-labeled training data
After piloting their algorithm, some AI teams opt to acquire pre-labeled training data from commercial or open sources. For the fortunate few, pre-labeled data offers a representative sample of their problem universe, is annotated in ways that are appropriate to the use case, and is available in sufficient volume and accuracy to train the algorithm. For everyone else, the pre-labeled data is a mirage. It may introduce sample bias into the training data. It may be affected by measurement bias from whatever instrument generated the data. And it may reflect prejudicial or societal bias on the part of the people who labeled the data. It’s not uncommon for an enterprise data science team to get an algorithm to, say, an 80% confidence level using pre-labeled data, and no higher.
Don’t go over the waterfall
Most software development teams learned a long time ago that the traditional waterfall methodology is a scary ride. Waterfall adherents regard applications as monoliths moving down an assembly line. First, the entire application was mapped out. Then it was architected and then coded. Finally, the entire system entered testing and, ultimately, deployment.
The waterfall method is vulnerable to complexity. Complexity makes applications bigger, with more interdependencies. In waterfall, these interdependencies aren’t fully explored until the testing stage, at which point any oversights or miscalculations that are discovered drive the entire project back to the architecture or coding stage.
Complexity also makes project lifecycles longer. And with long lifecycles development projects are at the mercy of business and technology changes, which can force an application do-over or result in a deployed system that is irrelevant.
No one writes software this way anymore. But this is how most enterprises still manage AI projects.
Software developers long ago migrated to a more iterative, agile, style of development. Data science teams should do the same.
AI initiatives that evolve in an agile fashion get broken up into manageable bite-sized pieces. This technique allows teams to learn continuously and re-prioritize where necessary, which ultimately leads to delivering value quicker and building the right model more efficiently. An agile approach gives teams flexibility and emphasizes constant development and testing.
An agile data science team would train an algorithm on a specific part of its larger role, get that part to an acceptable level of confidence, and then deploy that part, even if other parts of the algorithm aren’t production ready. In doing this, the team is getting value from their AI investment faster.
It can be scary to think about the risk surrounding AI initiatives today. But data science teams can reduce their fright if their data is trained right and their development process is sound.
Originally posted in insideBIGDATA