Getting Your Machine Learning Model Out to the Real World

Paul Tune Paul Tune
August 4, 2020 AI & Machine Learning

Anexciting development in the past few years is the proliferation of machine learning in the product. We now see state-of-the-art computer vision models being deployed on mobile phones. State-of-the-art natural language processing models are being used to improve search.

While there have been many articles on new, exciting machine learning algorithms, there isn’t as many on productizing machine learning models. There has been upcoming interest in the engineering of these systems, such as the ScaledML conference, and the birth of MLOps. However, productionizing machine learning models is a skill that is still in short supply.

In my career so far, I’ve gotten a few hard-earned lessons in productionizing machine learning models. Here are some of them.

Build or Buy?

Twitter recently has been all abuzz about GPT-3. Primed with the right seed, it turns out that GPT-3 can generate do many amazing things such as generating buttons on webpages and React code. Spurred by these exciting developments, your stakeholder has an idea: what if you build a GPT-3 equivalent to help speed up the development of the frontend of the company you work at using GPT-3?

Getting Your Machine Learning Model Out to the Real World
GPT-3 generating code for a layout from a statement (source: here)

Should you train your own GPT-3 model?

Well, no! GPT-3 is a 175 billion parameter model that requires the resources most companies outside of the likes of Google and Microsoft would be hard-pressed to develop. Since OpenAI is trialling a beta for its GPT-3 API, it would be more prudent to try it out, and see if the model solves your use case, and the API scales to your use case.

So then, to buy (more like rent) or build? It depends.

It may look like a cop-out answer, but it is an age-old question in the software industry. It is also such a large and complicated topic, that it deserves a post of its own.

However, here I can offer some guidelines that can help you discuss buying vs. building question with a stakeholder. The first is

Is this use case not a core competence? Is it generic enough for an existing/potential AWS¹ offering?

The underlying principle of this question is to ask yourself if what you’re about to work on is part of the company’s core competence. If your stakeholders can afford it, it may be worth buying an off-the-shelf solution.

Are resources besides data available? E.g., talent and infrastructure?

At its heart, building and deploying a reliable machine learning system is still mostly software engineering, with an added twist: data management. If the team doesn’t have experience building models as service APIs, with collecting, cleaning, labelling and storing data, then the amount of time needed can be prohibitive, say about 1 to 2 years (or even more). Instead, deploying a third party solution could take 3 to 6 months since it’s mostly about integrating the solution into the product. Business owners would find this savings of time and cost extremely attractive.

On the other hand, if it’s related to the company’s core competence, then investing in building up a team that can design and deploy models based on the company’s proprietary data is the better way to go. It allows much more control and understanding of the process, though it does come with a high initial investment overhead.

Just remember

All that matters to a business owner is: how will the output of your model be useful in the context of the business?

Broadly speaking, machine learning is a next step up in automation. So in thinking this way, your model helps to increase revenue, or improves business efficiency by reducing costs.

For instance, integrating a better recommendations system should lead to higher user satisfaction and engagement. Higher satisfaction leads to more subscriptions and organic promotion, which leads to higher revenue and lower churn.

As a data scientist, you have to help the business owner make the best decision on whether to build or buy, and that means assessing the specific pros and cons of either approach in the context of the business.

The points in this article are very useful in approaching the “build vs. buy” question, and it also applies to machine learning production. There’s a case for building it yourself, as argued here, if it is a core business competence.

It all starts from the source

Consider the search results for a user who searched for photos of puppies. Here we have an example from Unsplash, one from Shutterstock, and finally one from EyeEm, this time on a mobile device.

Getting Your Machine Learning Model Out to the Real World
Source: Unsplash
Getting Your Machine Learning Model Out to the Real World
Source: Shutterstock
Getting Your Machine Learning Model Out to the Real World
Source: EyeEm on mobile

We can see that while each user interface(UI) displays images in a wide selection. Suppose we want to use clicks on the image to train a model that gives better search results.

Which image would you be drawn to first if you were a user?

In the Shutterstock example, you may first be drawn to puppy on the right most side of the window, but in the EyeEm example, it could be the image in the center.

It is clear that there are differences between each UI to cause differences in the distribution of the data collected.

Information retrieval literature has demonstrated that a positional bias² and presentation bias³ exists when displaying results. It is entirely possible for users to click on the first ranked result, even if it’s not relevant to their search query, simply because it’s the first thing they see, and they might be curious about the result!

We can see that the click data used to train search ranking models will have these biases in them, driven by how the UI was designed. So,

Know where and how the model will ultimately be used in the product

A data scientist and machine learning engineer cannot build models in isolation from where the model will ultimately be used. A consequence of this is that a data scientist has to work with all manners of specialties who are involved in the product: the product manager, the designer, the engineers (frontend and backend) and quality assurance engineers.

The cross-functional team work provides benefits, including briefing all team members about the limitations of the model, how that affects UI design and vice versa. Another is to find simple, reliable ways to solve problems, for instance, not solving an inherent UI (and user experience) issue using machine learning. Designers can also come up with ways to make the presentation of results from the model look amazing, as well as constraining user actions in the UI to limit the space of inputs that goes into a model.

A data scientist must build up domain knowledge about the product, including why certain UI decisions were made, and how they can impact your data downstream if product changes are made. The latter point is especially important, since it can mean your model is outdated very quickly when product changes are rolled out.

Moreover, how the results from the model are presented is important so as to minimize bias in future training data, as we shall see next.

Data collection is still challenging

Analytics collection typically depend on engineers setting up analytics events in the services they are responsible for on various parts of the product. Sometimes, analytics are not collected, or collected with noise and bias in them. Analytics events can be dropped, resulting in missing data. Clicks on a button are not debounced, resulting in multiple duplicate events.

Machine learning models are, unfortunately, typically downstream consumers of these analytics data, which means problems with measurements and data collection has a huge impact on the models.

If there’s one thing to remember about data it’s that

Data is merely a byproduct of measurement

This means that it is important to understand where the data is coming from, and how it is being collected. It is also then paramount that measurements are made correctly, with tests on the measurement system conducted periodically to ensure good data quality. It would also be prudent to work with an engineer and a data analyst to understand the issues first-hand because there’s usually a “gotcha” in collected data.

Stored data needs to be encoded to suit the business domain. Let’s take image storage for instance. In most domains, a lossy JPEG image may be fine to encode your stored images, but not in some domains where very high resolution is needed, such as using machine learning for detecting anomalies in MRI scans, for instance. It’s important that these issues be communicated up-front to engineers who are helping you collect the data.

Machine learning in research differs from machine learning in the product in that generally your biggest wins come from better data, not so much better algorithms. There are exceptions, though, such as deep learning improving performance metrics in many areas. Even then, these algorithms are still sensitive to the training data, and you need a whole lot more data.

It is always worth convincing your stakeholder to continually invest in improving the data collection infrastructure. Sometimes, just even a new source of data to construct new features can boost model performance metrics. Not only does it help with better models, it helps with better tracking of the company’s key metrics and product features overall. The data analysts you work with will be very thankful for that.

As an ex-Dropbox data scientist told me recently

Your biggest improvements usually come from sitting down with an engineer to improve analytics

Check your assumptions on the data

Here is one story I experienced first-hand from the trenches.

During the early days of incorporating learning-to-rank models to improve image search at Canva, we planned to use a set of relevance judgments sourced from Mechanical Turk. Canva has a graphic design editor tool that allows a user to search over 50 million images to include into their design.

Basically, what this meant was collected sets of pairwise judgements: for each search term, raters are given a pair of images, one of them which they give a thumbs up to. An example pairwise judgement rating for the term firefly is shown below. Although both are relevant, the image on the left is more preferred by the rater than the image on the right.

pairwise judgement for the search term “firefly"
A pairwise judgement for the search term “firefly”: the left image is preferred by a rater over the right image (source: author)

We then trained a model using these pairwise judgements, and another model that solely relied on clicks from search logs. The latter has more data, but is of lower quality due to positional bias and errant clicks. Both models look great on offline ranking metrics, and a visual check of the results (i.e., manually, using our eyes).

In online experiments, however, the model that relied on these sourced relevance judgements had a very poor performance, tanking business metrics, compared to the control and the other model.

What happened?

As it turned out, we sourced the judgements without the context of users searching for images to fit their design, not just on pure search relevance alone. This missing context contributed to a dataset that had a different distribution from the data distribution in the product.

Needless to say, we were wrong in our assumption.

Data distribution skew is a very real problem: even the team that implemented the Quick Access functionality in Google Drive at Google faced it⁴. In this case, the distribution of data collected for their development environment did not match the data distribution of the final deployed environment, as the training data was not collected from a production service.

The data used to train your models should match the final environment where your model will be deployed

A sample of other assumptions to check for are

  1. the appearance of a concept drift, where a significant shift in the data distribution happens (a very recent example is the COVID-19 outbreak messing with prediction models),
  2. the predictive power of features going into the model as some features decay in predictive power over time, and
  3. product deprecations, resulting in features disappearing.

These are partially solvable by having monitoring systems in place to check data quality and model performance metrics, as well as flag anomalies. Google has provided an excellent checklist on scoring your current production machine learning system here, including data issues to watch out for.

Remember, if a team at Google, whose maturity in deploying machine learning systems is one of the best in the world, had data assumption problems, you should certainly double-check your assumptions about your training data.

Wrapping it all up

Machine learning systems are very powerful systems. They provide companies with new options in creating new product features, improving existing product features, and improving automation.

However, these systems are still fragile, as there are more considerations about how it’s designed, and how data is collected and used to train models.

I still have more to say on the topic, but these will do for now.


[1] Substitute with your favorite cloud provider.

[2] Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay, Accurately Interpreting Clickthrough Data As Implicit Feedback, In Proc. of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp.154–161 (2005).

[3] Yisong Yue, Rajan Patel, and Hein Roehrig, Beyond Position Bias: Examining Result Attractiveness As a Source of Presentation Bias in Clickthrough Data, In Proc. of the 19th International Conference on World Wide Web (WWW), pp. 1011–1018 (2010).

[4] Sandeep Tata, Alexandrin Popescul, Marc Najork, Mike Colagrosso, Julian Gibbons, Alan Green, Alexandre Mah, Michael Smith, Divanshu Garg, Cayden Meyer, Reuben Kan, Quick Access: Building a Smart Experience for Google Drive, KDD ’17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.1643–1651 (2017)

  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Paul Tune

    Tags
    Machine LearningMachine Learning ModelsTrain Models
    Leave a Comment
    Next Post
    5 Ways IoT Transforms The Financial Tech World

    5 Ways IoT Transforms The Financial Tech World

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in AI & Machine Learning
    AI & Machine Learning,Future of Work
    AI’s Role in the Future of Work

    Artificial intelligence is shaping the future of work around the world in virtually every field. The role AI will play in employment in the years ahead is dynamic and collaborative. Rather than eliminating jobs altogether, AI will augment the capabilities and resources of employees and businesses, allowing them to do more with less. In more

    5 MINUTES READ Continue Reading »
    AI & Machine Learning
    How Can AI Help Improve Legal Services Delivery?

    Everybody is discussing Artificial Intelligence (AI) and machine learning, and some legal professionals are already leveraging these technological capabilities.  AI is not the future expectation; it is the present reality.  Aside from law, AI is widely used in various fields such as transportation and manufacturing, education, employment, defense, health care, business intelligence, robotics, and so

    5 MINUTES READ Continue Reading »
    AI & Machine Learning
    5 AI Applications Changing the Energy Industry

    The energy industry faces some significant challenges, but AI applications could help. Increasing demand, population expansion, and climate change necessitate creative solutions that could fundamentally alter how businesses generate and utilize electricity. Industry researchers looking for ways to solve these problems have turned to data and new data-processing technology. Artificial intelligence, in particular — and

    3 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: support@experfy.com

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2023, Experfy Inc. All rights reserved.