In previous articles in this series on the usage of a data library I dove into the first two of the four characteristics of a data library. This article will explain how the last two characteristics come together in the “operationalization” of your data library.
What is a data library?
* A set of principles for data management, not a technology stack.
* An informal, loosely but adequately connected data architecture consisting of data ponds, analytics datasets, and reporting datasets.
* A balance of speed of development, agility, usability, and cost.
* Prioritizes inclusion of data based on potential business value, difficulty, and data privacy concerns for a particular data source.
At a previous company I reported to a former engineer-turned-data-analytics-leader who often urged us to not only create, but “operationalize” our data products. After being confused for some time I asked him to explain what he meant by bringing that engineering term into the data context. “Operationalizing” means making your report, analysis, dashboard, model, etc. into a mature product. Like any external product it should have a defined purpose and audience and be launched into the “market” in a state that is ready to bring value to a customer. A data library is a great foundation to ensure that the work your team does can be effectively and efficiently operationalized.
Data Libraries balance speed of development, agility, usability, and cost
Data Libraries support fast development
Data wrangling, the collection, cleaning and preparation of data for analysis, can easily take more time than any other aspect of the data analytics workflow, as much as 80-90%. But in a data library the usage of data ponds, analytics reservoirs, and reporting reservoirs means that your data should be ready for most new analysis, modelling, and/or reports with little wrangling.
The data pond will be updated automatically and stored in a location that is accessible to the needed tool for the project. Adequate documentation on the data is in place so that any analyst can take advantage of the data ponds that have been built by others on the team. The data library principles provide the foundation for faster development time.
Data libraries are agile and accommodate many use cases
When a new project demands adding a column to a table this can expand the project scope to include editing a database and the data collection process. Data ponds with my preferred long table structure can easily accommodate new fields without needing to add a column to a table.
By using data ponds you should have the data you need for a project most of the time anyway. The data pond construction process requires long-term planning about data that might be needed for a variety of use cases rather than focusing merely on the data needed for a particular project.
As I described previously data libraries do not require a particular tech stack and can utilize what your company/team already has available to it with little to no additional investment in BI tools or servers. Using Power BI instead of Excel is preferable, as is storing data in a database rather than flat files; but these investments are not absolutely necessary if cost is an issue. The data library principles are tech stack agnostic.
When building a data library, prioritize data based on business value, difficulty, and data privacy concerns
Just as a public library’s collection is not acquired at once, data libraries should be built incrementally with iterations through a standard cataloguing process for each data source. This allows you to get value from the library quickly. If you want to build a data library while also keeping up with regular projects, reporting, and ad hoc requests then set a goal for the pace at which you will build the library. A simple goal is to choose to add one-to-three data sources per quarter. 25% of the team’s time spent on cataloguing new data sources should be enough to make significant progress while also keeping up with its most important priorities.
If you set a goal to add two data sources per quarter then you may need to spend anywhere from six months to multiple years to build out the initial library. Given those time constraints it is paramount that you only catalogue data sources that have significant business value and that these are ranked. I rank data sources according to three elements:
Prioritize data sources according to these elements
- Business Value
- Difficulty
- Privacy concerns
Business Value
For many problems business value is difficult to quantify. In my experience the value of data products has usually been to better understand the business and support future decision-making. Straightforward, direct cost-savings or revenue increases may be more common for certain types of analytics teams and industries but just getting better at descriptive analytics, the most common type, is a worthy goal for most.
While quantifying the value is difficult it is possible, especially in relative terms when comparing data sources. A consistent scoring rubric should be used with questions that reflect the value that your company receives from analytics. This may include things like saving a busy person time, increasing the frequency at which reports can be updated, and making new analyses possible that were previously too difficult or time-consuming to pursue. In my next article I will share the rubric that I use at TechSmith to evaluate and compare the business value of more than 25 data sources.
Difficulty
Difficulty is even more subjective than business value as it is derivative of the skills of and technology available to an analytics team. What might be easy for one team is very difficult for another. Similarly one data source may be very easy to collect, store, document, and clean but near-impossible to automate. Like the business value these measures should be developed for your company and each data sourced scored. The next article will show what we use at TechSmith.
Privacy concerns
Data privacy is often overlooked or intentionally disregarded. Until recently there existed little regulatory incentive, though the ethics of data privacy should be considered regardless of where your company operates. An analytics team that works with personal information should have standards regulating things like how long that data is retained, who has access, and how it is secured.
It is not necessary to score data sources based on privacy concerns but the concerns should be identified and a “t-shirt size” estimate (small, medium, or large) given. This will help to identify investments in data privacy policy and/or technology that is needed in order to ethically build a data library (and comply with applicable regulations).
Visualize your prioritization in a matrix
I recommend visualizing the results of doing this prioritization work in a matrix that shows where data sources land on the dimensions of value and difficulty, as well as the level of privacy concerns. In the next article I will share the design of the visual I use as well as the R code to create it.