The quarterly report showed a 40% drop in sales over the last quarter. The executive team panicked. The entire company is collapsing! Or is it? After the investigation concluded, they discovered that the sales were okay. The report was incorrect because an automated system had failed. The infrastructure that copied files with orders didn’t copy 6 out of 10 files to the data lake — false alarm. The problem was not sales, but rather the quality of the data that was the basis of the dashboards.
Data quality is essential for trustworthy decisions
Data quality is critical because data is used for decision making and powering AI models. Models and decisions are only as good as the data behind them, so any lack of confidence in the data means they are less useful in predicting and providing insights, slowing down, and undermining fast decision making. Trust in data is hard to get and easy to lose, so data quality must be maintained for models and dashboards to be useful at all times. In the last 10 years, volume, variety, and velocity of business data have increased dramatically, making it impossible to control the quality of data with static testing. Therefore, To ensure confidence in analytics and predictive models at all times, data-driven firms need to be able to monitor data quality in production.
In this blog post, we review a typical data flow, causes of data corruption how to set confidence levels and data quality goals. Lastly, we go under the hood and explain our solution to monitoring data quality in production in real-time.
First, let’s look at a typical data flow
Below is a basic diagram of a typical data flow. Data enters the flow diagram from internal databases and other sources. It enters a central repository consisting mainly of the enterprise data warehouse and data lake. Data is then distributed to various outputs as required.
Anywhere along this data flow, corruption may occur. We look at protections at each step.
The leading causes of data corruption
Data corruption takes on many forms and has many causes. Three of the most prevalent causes of data corruption are code, data sources, and infrastructure. Catching data corruption requires data monitoring in production in realtime.
The application itself may cause some data corruption. Traditional application code debugging techniques may find some bugs, but there is a high probability it may not find all. Static testing of code in data pipelines is not enough because it is expensive to have 100% coverage, so not all defects are identified.
Data sources and extraction corruption
Data may be corrupt directly out of the data source. Sensors may fail; human error during data input could occur.
Sources out of the direct control of the application can cause data corruption. Networks can go down; power glitches can affect data integrity; even sunspots have been known to cause network mishaps.
Using confidence levels to fine-tune data quality
Data quality is not a binary measurement. Some data may be difficult to resolve as valid or corrupt. For example, an expected value of a column may be between 5 and 50, but 0.3% of values in the dataset are out of range. Does this indicate value corruption or simply outliers? To better understand these marginal values, a “confidence level” can be attached along with the data quality flag.
The goals of data quality monitoring
Let’s focus on the data lake. With no data quality monitoring, the data lake would look like this:
On the left are jobs supplying data to the lake. Data is stored and manipulated with no quality checking. Any corrupted data would not be detected, and it can spread the corruption to adjoining modules.
In the figure below, data quality jobs get inserted between regular data processing jobs. Each data processing job is isolated from the others with the data quality jobs:
When the data quality job detects an anomaly, we immediately instigate action before the corruption spreads. Depending on the severity of the corruption, the remedy may be to flag the error and continue or, for more severe issues, the pipeline may be stopped to prevent further corruption.
Let’s look at these steps in more detail
Detect data corruption
This blog deals mainly with this section. The first step in controlling data quality is always detecting the offending data point.
Prevent it from spreading
Once we find an issue, it should be isolated from causing further corruption. Depending on the degree of severity, this could mean tagging the offending data point, disabling a sub-system, or even shutting down the entire system.
Once we identify an issue and have it isolated, the next step is to fix it. The support team responsible for the system is alerted. They may take one of several actions depending on the source of the issue. A system or subsystem may need rebooting. A sensor may need repair. The offending datapoint may need removal. These are a sampling of possible remedies.
Methods to implement data quality in real-time
The proper way to enforce data quality in real-time is to embed data quality assurance logic in the data processing pipeline, after each step of the pipeline.
Control divergence from SoR at the input
- Validate correctness of the imported data
- Prevent stale data from entering the system
- Prevent corruption accumulation in stream processing use cases
- Check data before it gets in the lake
Every system of record (SoR) shares the following characteristics: it provides the most complete, most accurate and most timely data, it has the best structural conformance to the data model, it is nearest to the point of operational entry, and it can be used to feed other systems.
The data quality job should compare the system of records and datasets in the analytical data platform to ensure completeness and control the staleness of data in the platform. Comparing the SoR and datasets is especially useful if the import is done by streaming, since errors may accumulate.
Validate business rules of stored data
- Enforce schema
- Check for nulls
- Validate data ranges
- Specify and enforce data invariants
Business rules are manually set up for data validation purposes. The system looks for issues such as nulls, boundaries for numerical fields, or other business validation rules for specific data fields. A simple business rule looks something like this:
-40 < x < 152
If the actual value falls outside of this predetermined range, it needs further investigation as possible data corruption.
A null is a lack of data. In short, if you expect data and do not get it, that is an error. While null detection sounds trivial, it may not get flagged as a data quality error.
Detect anomalies at the output
- Fully automatic data quality enforcement
- Collect data profile, metrics, and statics
- Train ML models
“Something does not look quite right.” Humans are naturally inclined to seek patterns and can notice when those disrupted patterns. Computers can be trained to do the same thing to a much finer degree.
Below is an example of a data graph bounded by a rolling average. An anomaly well outside of the normal range causes the rolling average to take a substantial departure from the norm and becomes immediately suspect:
Data quality reference demo
Grid Dynamics created a demonstration anomaly detection application using open-source modules Griffin, Grafana, ElasticSearch, along with custom code and ML models. The goal is to minimize the manual definition of rules and automatically generate anomaly detection logic and automatically embed it into the data pipelines. The figure below shows the demo reference architecture and technology stack:
Apache Griffin – An open-source Data Quality solution for Big Data, which supports both batch and streaming mode. It offers a unified process to measure your data quality from different perspectives, helping you build trusted data assets, therefore boost your confidence for your business.
Grafana – Grafana allows you to query, visualize, alert on, and understand your metrics no matter where they are stored. Create, explore, and share dashboards with your team and foster a data-driven culture.
ElasticSearch – Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.
Data quality jobs can be embedded in any data processing pipeline, implemented in Apache Spark, Apache Flink, Apache Beam, traditional Hadoop MapReduce jobs, or commercial data engineering tools. It can be integrated into on-premise data platforms or cloud API data platforms such as Google Dataproc, Google Dataflow, Google BigQuery, Amazon EMR, Azure Databricks Spark, Azure HDInsight.
Without trust, you have nothing…
In this blog, we covered the causes of data correction as well as methods to implement data quality in real-time. Trusting the quality of your data is as important as the data itself; any decision based on mistrusted data becomes mistrusted itself. The cascade effect caused by the lack of data quality monitoring can become severe enough to impact the entire enterprise negatively. This issue is not isolated to any one industry but can affect virtually any company. For a smooth-running organization, data quality is essential.
This article originally appeared at Grid Dynamics Blog.