At smaller companies access to and control of data is one of the biggest challenges faced by data analysts and data scientists. The same is true at larger companies when an analytics team is forced to navigate bureaucracy, cybersecurity and over-taxed IT, rather than benefit from a team of data engineers dedicated to collecting and making good data available.
Creative, persistent analysts find ways to get access to at least some of this data. Through a combination of daily processes to save email attachments, run database queries, and copy and paste from internal web pages one might build up a mighty collection of data sets on a personal computer or in a team shared drive or even a database.
But this solution does not scale well, and is rarely documented and understood by others who could take it over if a particular analyst moves on to a different role or company. In addition, it is a nightmare to maintain. One may spend a significant part of each day executing these processes and troubleshooting failures; there may be little time to actually use this data!
I lived this for years at different companies. We found ways to be effective but data management took up way too much of our time and energy. Often, we did not have the data we needed to answer a question. I continued to learn from the ingenuity of others and my own trial and error, which led me to the theoretical framework that I will present in this blog series: building a self-managed data library.
A data library is not a data warehouse, data lake, or any other formal BI architecture. It does not require any particular technology or skill set (coding will not be required but it will greatly increase the speed at which you can build and the degree of automation possible). So what is a data library and how can a small data analytics team use it to overcome the challenges I’ve described?
What is a data library?
* A set of principles for data management, not a technology stack.
* An informal, loosely but adequately connected data architecture consisting of data ponds, analytics datasets, and reporting datasets.
* A balance of speed of development, agility, usability, and cost.
* Prioritizes inclusion of data based on potential business value, difficulty, and data privacy concerns for a particular data source.
The data library approach is useful for the most common types of data that business create and use, but not everything. It will not accommodate unstructured data (unless that data is being stored in a structured way, like a database table). Structured data should only be added if there is business value that exceeds the cost of setup, storage, and administration.
In this series I will write articles on each of these four points both explaining the theory and providing practical examples of how it can be implemented.
Utilizing this framework I have frontloaded much of the data acquisition and data cleaning time that used to be a part of every new project. Bringing new data sources into our data library is a continuous priority, and once the data is there, it can fuel new analyses, models, and reports with little need for data munging.
On a daily basis our team spends very little time collecting data. On most days we review some basic data health metrics, find no problems, and go about our business.
We regularly check in with stakeholders for the data we are collecting. We have good–not great– documentation on each process and data element.
Yes, it has taken time to do this. But in less than a year most of the technical work has been done by two people who spend less than half their time on it, along with much consultation with IT, other data analysts, and our internal customers.
You may be a small data analytics team (or you may even be a one-piece band), but that doesn’t mean you have to settle for inefficient and incomplete data management.