The Challenge of Data Scale is Driving Data Center Automation

Ready to learn Data Science? Browse Data Science Training and Certification courses developed by industry thought leaders and Experfy in Harvard Innovation Lab.

Data scale is one of the top challenges that is changing the face of data centers

In the previous blog 5 top data challenges that are changing the face of data centers, I introduced a set of challenges presented by “New Data” – data that is both transactional and unstructured, publicly available and privately collected, and its value is derived from the ability to aggregate and analyze it. Loosely speaking we can divide this new data into two categories: big data – large aggregated data sets used for batch analytics – and fast data – data collected from many sources that is used to drive immediate decision making. The big data–fast data paradigm is driving a completely new architecture for data centers (both public and private).

In this blog, I would like to focus on Challenge #2: Data scale is driving data center automation.

Public cloud providers already operate at massive scale. As smaller organizations move to public cloud, the remaining private datacenters are also getting much larger. A big driver for this scale is data – both the data storage itself, and the requirements for computing scale to analyze that data.

Unfortunately, at the scale of Petabytes of storage and thousands of compute nodes, any manual management is simply cost-prohibitive. This is leading to a completely new set of storage architectures that can operate a large scale and require very little management of the data. This means management of moving the data, protecting the data, and making the data available at the right performance level for the analysis that is needed at any point in time – for example big data needs difference performance than fast data.

A new class of storage vendor has emerged, whose solutions accomplish this goal through a combination of 1) software defined storage 2) commodity building block hardware componentry 3) distributed scalable storage architectures and 4) application awareness. This combination ensures that many of the storage management needs of a datacenter are essentially automated as part of the data solution itself, and do not require outside manual interaction. Let’s look at each of these solution characteristics and how they make large scale datacenter operations cost effective:

Software defined storage on commodity building block hardware

By separating data management functions completely from hardware, new software defined storage solutions make it possible to build storage systems from a small set of common hardware building blocks in a datacenter. This dramatically reduces the cost of scaling. No pre-planning required, just pop a new compute and/or storage node into a rack and turn it on. New software defined storage solutions will start to use that new capacity and performance immediately and make that capacity available to all of the users in most cases. If new computing capabilities become available, there is no “lift and shift” of the existing nodes, because the software doesn’t really care about the underlying hardware. The software defined storage is now provisioned and managed in the software layer so this can be done once and not every time hardware is added. Eliminating hardware planning and management, and separating storage configuration from hardware cuts the management costs associated with large scale storage dramatically.

Distributed scalable storage architectures

Managing and provisioning individual storage units at large scale is essentially impossible. This has led to two compute and storage architectures that address this issue: Hyperconverged and Hyperscale. Both architectures provide small building block units of compute and storage that can be scaled out as the datacenter grows. Software defined storage on top of those architectures takes advantage of those building blocks to grow capacity and performance. A critical component of these solutions is a distributed storage software architecture. This means that storage is aggregated from across the entire hyperconverged or hyperscale hardware system and is then presented to users and applications as logical units of storage that can be any size. The majority of these software defined architectures are also smart enough to take advantage of different types of storage that operate at different performance (such as Flash, hard drives and cold storage). The software layers provide data management functions to perform functions like putting data on the right tier, caching data for performance, replicating data for reliability, deduplicating data for efficiency, etc. Smart storage software that can take advantage of large scale and non-uniform hardware eliminates another big swath of management cost.

Application Awareness

At the end of the day, data is accessed and used by applications. This means that most of the storage is going to be provisioned to individual applications. Historically, this was a manual operation performed by smart administrators, like DBAs (Database Administrators). However, in the world of large scale data analytics databases like Hadoop and memory centric databases like Spark and Mongodb, manually provisioning storage in units like LUNS to applications is just not feasible. Many of these applications also run in some kind of virtual environment, such as VMs or containers. New storage architectures often present logical storage from a large aggregated pool directly into the application. This eliminates a couple of big manual provisioning steps that were historically required for storage. In some cases, software defined storage architectures have the capability to give each application logical storage that is specifically created to match their unique performance requirements. For example, a MongoDB installation will require all flash storage, while a large Hadoop cluster can operate with hard drives quite effectively. The software defined storage solution can manage these choices automatically with some guidance from the application.

The bottom line for large scale datacenters is a requirement to move away from storage “in the box” and toward hyperconverged- and hyperscale-friendly software defined storage architectures that operate at scale with a minimum of manual intervention and management.

Keep an eye out for the next blog in this series on challenges that are changing the face of datacenter, coming soon.

The Challenge of Data Scale is Driving Data Center Automation

Software defined storage on commodity building block hardware

Distributed scalable storage architectures

Application Awareness

Technical Debt in Machine Learning