Good pipelines, bad data

Barr Moses Barr Moses
February 14, 2020 Big Data, Cloud & DevOps

How to start trusting data in your company.

Photo by Element5 Digital on Unsplash

It’s 2020, and we’re still using “photos and a paper trail” to validate data. In the recent Iowa election, data inconsistencies eroded trust in the results. This is just one of the many recent, and prominent, examples of pervasive “data downtime.”

Data downtime refers to periods of time when your data is partial, erroneous, missing or otherwise inaccurate — and almost every data organization I know struggles with it. In fact, this HBR article cites a study that found companies lose an average of $15M per year due to bad data. In this blog post, I will cover an approach to managing data downtime that has been adopted by some of the best teams in the industry.

So, what does it mean to measure data downtime?

To begin unpacking that, let’s look into what counts as “downtime”. Data downtime refers to any time data is “down”, i.e. when data teams find themselves answering “no” to common questions such as:

  • Is the data in this report up-to-date?
  • Is the data complete?
  • Are fields within reasonable ranges?
  • Do my assumptions about upstream sources still hold true?
  • … and more

Or in other words… Can I trust my data?

Answering these questions in real time is hard.

Data organizations large and small are challenged with these questions since (1) consistently tracking this information across data pipelines requires substantial resources; (2) at best, information is limited to a small subset of the data that had been laboriously instrumented; and (3) even when available, sifting through this information is tedious enough that teams often find about data issues in hindsight.

In fact, it is typical that data consumers — product managers, marketing experts, executives, data scientists, or even customers — identify data downtime right at the moment when they need to use the data. And somehow, that always happens late on a Friday afternoon…

How come we know everything about how well our data infrastructure is performing, but so little about whether the data is right?

A helpful corollary here is drawing on the world of infrastructure observability. Almost every engineering team has tools to monitor and track infrastructure and guarantee that it is performing as expected. This is often referred to as observability — the ability to determine a system’s health based on its outputs.

Great data teams make investments in data observability — the ability to determine whether the data flowing in the system is healthy. With observability comes the opportunity to detect issues before they impact data consumers, and to then pinpoint and fix problems in minutes instead of days and weeks.

So what makes for great data observability? Based on learnings from over 100 data teams, we’ve identified the following:

Each pillar encapsulates a series of questions which, in aggregate, provide a holistic view of data health.

  • Freshness: is the data recent? When was the last time it was generated? What upstream data is included/omitted?
  • Distribution: is the data within accepted ranges? Is it properly formatted? Is it complete?
  • Volume: has all the data arrived?
  • Schema: what is the schema, and how has it changed? Who has made these changes and for what reasons?
  • Lineage: for a given data asset, what are the upstream sources and downstream assets which are impacted by it? Who are the people generating this data, and who is relying on it for decision making?

Admittedly, data can break in endless ways and for a wide range of reasons. Surprisingly, we have found time and time again that these pillars — if tracked and monitored — will surface almost any meaningful data downtime event.

  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Barr Moses

    Tags
    Data Science
    Leave a Comment
    Next Post
    Five more tools and techniques for better plotting

    Five more tools and techniques for better plotting

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in Big Data, Cloud & DevOps
    Big Data, Cloud & DevOps
    Cognitive Load Of Being On Call: 6 Tips To Address It

    If you’ve ever been on call, you’ve probably experienced the pain of being woken up at 4 a.m., unactionable alerts, alerts going to the wrong team, and other unfortunate events. But, there’s an aspect of being on call that is less talked about, but even more ubiquitous – the cognitive load. “Cognitive load” has perhaps

    5 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    How To Refine 360 Customer View With Next Generation Data Matching

    Knowing your customer in the digital age Want to know more about your customers? About their demographics, personal choices, and preferable buying journey? Who do you think is the best source for such insights? You’re right. The customer. But, in a fast-paced world, it is almost impossible to extract all relevant information about a customer

    4 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    3 Ways Businesses Can Use Cloud Computing To The Fullest

    Cloud computing is the anytime, anywhere delivery of IT services like compute, storage, networking, and application software over the internet to end-users. The underlying physical resources, as well as processes, are masked to the end-user, who accesses only the files and apps they want. Companies (usually) pay for only the cloud computing services they use,

    7 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: support@experfy.com

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2023, Experfy Inc. All rights reserved.