Ready to learn Data Science? Browse Data Science Training and Certification courses developed by industry thought leaders and Experfy in Harvard Innovation Lab.
New data center architectures present new data challenges: how data capture is driving edge-to-core data center architectures.
Data is clearly not what it used to be! Organizations of all types are finding new uses for data as part of their digital transformations. Examples abound in every industry, from jet engines to grocery stores, for data becoming key to competitive advantage. I call this new data because it is very different from the financial and ERP data that we are most familiar with. That old data was mostly transactional, and privately captured from internal sources, which drove the client/server revolution.
New data is both transactional and unstructured, publicly available and privately collected, and its value is derived from the ability to aggregate and analyze it. Loosely speaking we can divide this new data into two categories: big data – large aggregated data sets used for batch analytics – and fast data – data collected from many sources that is used to drive immediate decision making. The big data–fast data paradigm is driving a completely new architecture for data centers (both public and private).
Over the next series of blogs, I will cover each of the top five data challenges presented by new data center architectures:
- Data capture is driving edge-to-core data center architectures: New datais captured at the source. That source might be beneath the ocean, in the case of oil and gas exploration, from satellites in orbit, in the case of weather applications, on your phone, in the case of pictures, video and tweets, or on the set of a movie. The volume of data collected at the source will be several orders of magnitude higher than we are familiar with today.
- Data scale is driving data center automation: The scale of large cloud providers is already such that they must invest heavily in automation and intelligence for managing their infrastructures. Any manual management is simply cost-prohibitive at the scale that they operate at.
- Data mobility is changing global networks: If data is everywhere, then it must be moved in order to be aggregated and analyzed. Just when we thought (hoped) that networks were getting faster than internet bandwidth requirements at 40 to 100 Gbps, data movement is likely to increase 100x to 1000x.
- Data value is revolutionizing storage: In a previous blog entitled, “Measuring the economic value of data,” I introduced a way of thinking about and measuring data value. There is no question that data is becoming more valuable to organizations and that the usefulness of data over longer periods of time is growing as a result of machine learning and artificial intelligence (AI) based analytics. This means that more data needs to be stored for longer periods of time and that the data must be addressable in aggregate in order for analytics to be effective.
- Data analytics is the driver for compute-intensive architectures in the future: Organizations are driven to keep more data in order to aggregate it into big data repositories, by the nature of analytics and in particular machine learning. These types of analytics provide better answers when applied against multiple, larger data sources. Analytics and machine learning are compute intensive operations. As a result, analytics on large datasets drive large amounts of high speed processing. At the same time, the compute intensive nature of analytics is driving many new ways to store and access data, from in-memory databases to 100 petabyte scale object stores.
Challenge No. 1: Data capture is driving edge-to-core data center architectures
New data is captured at the source. The volume of data collected at the source will be several orders of magnitude higher than we are familiar with today. For example, an autonomous car will generate up to 4 terabytes of data per day. Scale that for millions – or even billions of cars, and we must prepare for a new dataonslaught.
It is clear that we cannot capture all of that data at the source and then try to transmit it over today’s networks to centralized locations for processing and storage. This is driving the development of completely new data centers, with different environments for different types of data characterized by a new “edge computing” environment that is optimized for capturing, storing and partially analyzing large amounts of data prior to transmission to a separate core data center environment.
The new edge computing environments are going to drive fundamental changes in all aspects of computing infrastructures: from CPUs to GPUs and even MPUs (mini-processing units)—to low power, small scale flash storage—to the Internet of Things (IoT) networks and protocols that don’t require what will become precious IP addressing.
Let’s consider a different example of data capture. In the bioinformatics space, data is exploding at the source. In the case of mammography, the systems that capture those images are moving from two-dimensional images to three-dimensional images. The 2-D images require about 20MB of capacity for storage, while the 3-D images require as much as 3GB of storage capacity representing a 150x increase in the capacity required to store these images. Unfortunately, most of the digital storage systems in place to store 2-D images are simply not capable of cost-effectively storing 3-D images. They need to be replaced by big data repositories in order for that data to thrive.
In addition, the type of processing that organizations are hoping to perform on these images is machine learning-based, and far more compute-intensive than any type of image processing in the past. Most importantly, in order to perform machine learning, the researchers must assemble a large number of images for processing to be effective. Assembling these images means moving or sharing images across organizations requiring the data to be captured at the source, kept in an accessible form (not on tape), aggregated into large repositories of images, and then made available for large scale machine learning analytics.
Images may be stored in their raw form, but metadata is often added at the source. In addition, some processing may be done at the source to maximize “signal-to-noise” ratios. The resulting architecture that can support these images is characterized by: (1) data storage at the source, (2) replication of data to a shared repository (often in a public cloud), (3) processing resources to analyze and process the data from the shared repository, and (4) connectivity so that results can be returned to the individual researchers. This new workflow is driving a data architecture that encompasses multiple storage locations, with data movement as required, and processing in multiple locations.
For manufacturing IoT use cases, this change in data architecture is even more dramatic. For example, at Western Digital, we collect data from all of our manufacturing sites worldwide, and from individual manufacturing machines. That data is sent to a central big data repository that is replicated across three locations, and a subset of the data is pushed into an Apache Hadoop database in Amazon for fast data analytical processing. The results are made available to engineers all over the company for visualization and post-processing. Processing is performed on the data at the source, to improve the signal-to-noise ratio on that data, and to normalize the data. There is additional processing performed on the data as it is collected in an object storage repository in a logically central location as well.
Since that data must be protected for the long term, it is erasure-coded and spread across three separate locations. Finally, the data is again processed using analytics once it is pushed into Amazon. The architecture that has evolved to support our manufacturing use case is an edge-to-core architecture with both big data and fast data processing in many locations and components that are purpose-built for the type of processing required at each step in the process.
These use cases require a new approach to data architectures as the concept of centralized data no longer applies. We need to have a logically centralized view of data, while having the flexibility to process data at multiple steps in any workflow. The volume of data is going to be so large, that it will be cost- and time-prohibitive to blindly push 100 percent of data into a central repository. Intelligent architectures need to develop that have an understanding of how to incrementally process the data while taking into account the tradeoffs of data size, transmission costs, and processing requirements.
At Western Digital, we have evolved our internal IoT data architecture to have one authoritative source for data that is “clean.” Data is cleansed and normalized prior to reaching that authoritative source, and once it has reached it, can be pushed to multiple sources for the appropriate analytics and visualization. The authoritative source is responsible for the long term preservation of that data, so to meet our security requirements, it must be on our premises (actually, across three of our hosted internal data centers). As the majority of cleansing is processed at the source, most of the analytics are performed in the cloud to enable us to have maximum agility.
For more information about our internal manufacturing IoT use case, see this short video by our CIO, Steve Philpott.
The bottom line is that organizations need to stop thinking about large datasets as being centrally stored and accessed. Data needs to be stored in environments that are appropriate to its intended use. We call this “environments for data to thrive.” Big data sets need to be shared, not only for collaborative processing, but aggregated for machine learning, and also broken up and moved between clouds for computing and analytics. A data center-centric architecture that addresses the big data storage problem is not a good approach. An edge-to-core architecture, combined with a hybrid cloud architecture, is required for getting the most value from big data sets in the future.
The next blog in this series will discuss data center automation to address the challenge of data scale.