In the previous article I explained the technology-agnostic principles behind a good data library. This article gives specific examples of how these principles may be implemented.
Let’s dive in to the examples of how to implement Data Library principles
There are several components to successful automation. The most obvious one is the ability to schedule. Windows Task Scheduler is a simple and free option for windows users to trigger many desktop applications like R, Visual Basic in Excel, etc. Free options exist for Mac users too.
It is better to set up the automation from a persistent server rather than someone’s personal machine. Personal machines may be shut down, go idle, get replaced, or go away when someone leaves the company. Persistent servers can be accessed by multiple team members and are more likely to be running when your tasks need to trigger. When my teams have been able to utilize a remote Windows server or Azure virtual machine there have been far fewer disruptions to automated processes compared to scheduling on personal machines.
Tools such as Power BI, RStudio Connect and SQL Server Management Studio (SSMS) also offer scheduling functions. When considering investing in a new tool or service, automation should be a chief concern– it should either support it natively or work with a general service like Windows Task Scheduler.
I strongly recommend that your team retain the control over automation. If there is an option to utilize an enterprise IT scheduling tool, it may be tempting. If you need to submit a ticket to make a change, however, then this solution is antithetical to the data library characteristics of agility and speed.
Basic Software Development Principles
A “test environment” can be as simple as having two folder structures, one with “PROD” as the parent folder and the other with “DEV” as the parent. With the change of a single parameter you can make changes in a safe test environment, and with a save-as you “promote” a script or file to production.
Github is a widely-used, cloud-based, free option for version control though it may not be allowed by an organization’s cyber security policies. Sharepoint is not designed for development but can be used for maintaining versions of the most important files and scripts. I do not recommend doing “manual” versioning with many files following some kind of naming syntax if it can be avoided.
Ideally shared code would be embedded in the version control platform and be accessible to the processes that need to call it. Shared files or tables should have a dedicated space in your storage location; I currently put all non-system-specific reference tables in a SQL server database within the ReferenceTables schema.
In the previous article I described the attributes that the storage location should possess. At TechSmith our data library is all contained within a single Microsoft Azure SQL Server instance. Six databases are primarily utilized, three each for production and development environments. The three databases are for data ponds, analytics tables, and reporting tables– a topic on which I will expand in the next article in this series.
With this setup permissions can be granted on the database or schema level. Our analytics tools can all connect to it securely. SSMS is available for free to work directly with the database. Most importantly this location was easy to use because our IT is highly experienced with the Microsoft stack; in fact, this instance already existed and had available space so there was no marginal cost.
The previous article described my preferred method of structuring raw data tables. For analytics and reporting tables I prefer to do pre-processing in R or SQL, but others may prefer to do manipulation in Power BI with Power Query or DAX. The goal is to structure the raw data, and utilize reporting and analytics tables, in a way that minimizes the level of effort to produce new reports, analyses and models in the future. Ultimately this will work best by leveraging the strengths of the team members, whether it be in SQL, a language like R, or the BI tool.
Readiness is a product of the outcomes of implementing the other data library principles. Automation ensures that new data is present and does not need to be collected for a report. Centrally-stored code allows one to save time when writing a new script. The architectural design should enable your team to have relatively low development effort.
Data library documentation should help someone who is unfamiliar with a data source or table know enough to begin exploring and using it, or know who to ask for help. It should answer questions like:
- Why did we add this table to the library?
- How is the table updated?
- Where is the code or process stored that builds the table?
- What does a row represent? How is a record defined?
- What data is stored in a particular column?
- Are there data privacy concerns?
This could be managed in text, a spreadsheet, or a more specialized tool like a data catalog. I have used all of the above. Currently at TechSmith we use Microsoft Azure Data Catalog because it was included as part of our Microsoft stack. Regardless of the method you should develop best practices to ensure consistent usage across users.
The standards one uses may draw from an established standard for a particular language, an internal company standard, or both. At TechSmith, database objects are named with PascalCase, which I adopted for schema and table names. Within the R data science community snake_case is the standard which I adopted for R objects. This created a conflict for column names as data passed between the database and R. I chose snake_case because it is my personal preference and because many more lines of code I write is in R rather than SQL.
Monitor Data Health
The biggest potential downside of the informal, bootstrapped data library is that there are far fewer controls on data quality. Like a car it is necessary to do both preventive maintenance and be on guard for irregular problems that need to be fixed.
Analytics teams should have data health reports within the data library that catch basic and common problems early and regularly. If a process doesn’t run, or has an error, it should show up in the data health report. The report should be checked every day or an alert should be pushed in the event of an automation error.
A data health report also identifies potential anomalies and outliers. It should be just as important, or more important, to check the data going into a model as it is to review metrics on model performance.
The data health report changes over time. As particular problems stabilize then those health checks are removed, while other health checks are added in response to new, recurring issues. With a small up-front investment in proactive data health monitoring much time and energy is saved later on costly retroactive troubleshooting and explaining to customers why the numbers were inaccurate.
After reading this article you should know the decisions that need to be made in order to get started on building a data library. The next article will go more in-depth on the data management strategy that I recommend.