Ready to learn Data Science? Browse Data Science Training and Certification courses developed by industry thought leaders and Experfy in Harvard Innovation Lab.
Who is winning with data, where is this data being kept, what makes new data different, when should data be kept, moved, deleted or transformed, how should data be valued, and why data is so much more important than it used to be?
Today, there are new and changing uses of data in the digital economy. The big questions however are, who is winning with data, where is this data being kept, what makes new data different, when should data be kept, moved, deleted or transformed, how should data be valued, and why data is so much more important than it used to be?
Data is one of the most important assets that any company has, but it’s surprising that we don’t put the same rigor into understanding and measuring the value of our data that we put into more traditional physical assets. Furthermore, should data be depreciated as an asset? Or perhaps it appreciates like art. Counter-intuitively, the answer is “yes” to both.
To measure the value of data, we can apply some simple economic principles. For example, we know from IT forecasts that the amount of digital data being produced is much higher than the amount of data storage being purchased.
Why is this the case? Simply put: Some data is produced, but not stored. Applying a simple supply and demand curve helps to explain why. (Source: IDC Digital Universe, Recode.)
The “supply” curve is storage capacity. This capacity is available at a range of prices, but there is a minimum price threshold for storage capacity. The “demand” curve is data produced. This data creates a demand for storage capacity. That data is stored as long as the value of the data exceeds the cost to store and access it over time. A simple equation to represent this is:
Data Value (t) >= (Sc + Mc + Ac)/GB/yr * Retention period
Basically data value as a function of time must be greater than the sum of the cost for networking, compute and storage infrastructure(Sc), the cost to maintain the data and infrastructure (Mc) and the cost to access the data (Ac). To normalize between cloud and non-cloud IaaS, all of the measures are per unit per year. This cost is also a function of time, as measured by the retention period for the data.
Of course, in real life, the equation is not quite so simple. The cost to store and maintain data depends on how long it must be kept, whether it must be protected and secured, etc. There is also a penalty for keeping data because there is an increased risk of security violations, and a risk of data loss or data corruption over time.
On the value side of the equation, there is not a well-defined measure for data value. The value of data is really a measure of business value as a result of using or analyzing that data in some way. In addition, there is a correlation between the amount of data kept, how accessible that data is, and its value. For example, having more data makes all of the data more valuable if the use of the data depends on a historical trend. For example, use of machine learning is already changing the value of larger data sets because most machine learning algorithms work better when trained with large amounts of data.
The area under the curve represents the amount of data that is created but not stored because its value is perceived to be lower than the cost to keep it. If we start with the hypothesis that people would keep all of the digital data that they produce if they could, then we want to eliminate the area under the curve. In order to do that, we need to either lower the cost of storing and maintaining data or increase the value of the data – or both.
Once we accept the premise that data value should be measured, what would we do with this measure? On the infrastructure side, I propose that we focus solutions on maximizing the data value measure, rather than a more infrastructure-specific measure, like price per GB, power usage efficiency, or cross connect per month cost. That will drive innovation in infrastructure solutions in two distinct directions: Lower total cost of ownership on the one hand, and increased enablement for increasing data value on the other hand. This might encompass features for index and search, built in analytics interfaces, data movers, etc.
I believe that enterprises need to start applying their own business metrics to establishing a value for their data. Historically, enterprises have done this primarily for measuring changes in data value over time, under the assumption that older data is less valuable because it is accessed less. This is no longer a valid assumption, and access to data is not the most relevant measure of its value, but rather the analytical output of data. Future issues of this blog will discuss many of these topics in more detail.