Big Data-Enabled Predictive Analytics to Tackle Client Incidents at Intel

Intel’s extensive PC infrastructure, with more than 95,000 machines, generates about 80 percent of the total number of incidents such as system or application problems reported across the enterprise. Thus, enterprise PCs contribute to the highest volume of overall incidentsclaiming a high-priority spot for potential cost savings to the organization.

A problem detected in one client may soon escalate to thousands of similar machines, leading to acute disruption of user productivity, loss of valuable work, and loss of IT assets. One of Intels long-standing goals has been to proactively solve client issues to reduce the probability of recurrence. In 2013, it was estimated that a reduction or removal of client PC incidents could lead to a 40 percent cost savings for enterprise IT.

This Proof of Concept (PoC) solution has the potential to unleash new value in other data logs, such as those captured in Intels manufacturing, marketing, supply chain, or other functions.

The goal

In early 2013, the enterprise IT at Intel set a goal to reduce the volume of reported IT incidents requiring human intervention by 40 percent by the end of the year. In order to fulfill this goal, Intel IT designed a client incident prediction PoC using Intel® Distribution for Apache Hadoop with Hadoop version 2.2. Text analytics was used to detect correlation among millions of client-log data, and thousands of client incident reports, with a firm mission to anticipate and solve client problems before they spread across the enterprise.

This kind of solution involving incident prediction for problem management brings predictive analytics to the forefront (service desk), requiring a proactive rather than a reactive approach. For Intel IT, this essentially meant approximately 20 percent fewer incidents per year.

The solution methodology

Windows Event Log meticulously records all issues related to a client machinefrom an application startup failure to an incomplete action. These events are categorized as critical, error, information, or audit, based on their nature and severity. Event logs can easily amount to about 2,000 logs per day per machine with an average of 40 critical events. When this volume of data is scaled up to 95,000+ clients across Intel, the total volume of data may be in the area of 19 million events per day generating up to 300 gigabytes of data across the enterprise in a quarter.

Applying predictive analytics tools to event logs and incident reports, which indicate a combination of structured and unstructured data requires at least a years worth of data amounting to 1 terabyte or more. In the pre-client-incident-prediction PoC days, most of the event log data was left unused by Intel IT. The tremendous workload and time requirements associated with such kind of analytics forced many IT organizations to give up the long-term goal of predicting incidents before they happen.

In 2009, Intel IT made a major move towards a proactive approach to problem management. They engineered a tool for collecting blue screen system crash data from thousands of clients PCs to identify the root cause of incidents.

When they began to provide solutions for the top-priority issues, the number of blue screens was reduced from 5,500 a week to fewer than 2,500 a week. This exercise also helped identify client machines with repetitive occurrences of the same problems. The client incident PoC was used to address a given incident only once. The ultimate goal was to match client incidents across similar machines and then diagnosing the problems source (root cause). Once a solution was found for a particular incident, it could be implemented across enterprise PCs, preventing a recurrence of the problem.

To enable this solution, Intel IT exploited a big data platform based on Intel® Distribution for Apache Hadoop software with Hadoop 2.2. The results of comprehensive text analytics of client event logs and client incident reports were compared via a data visualization solutionto track down client problems to their first appearance in the environment. This new-found capability helped the IT team to anticipate PC problems before the happened, and on many occasions, enabled solutions before the problem appeared.

If you want know to get a full blown view of the Solution Architecture, then read the 10-page white paper Reducing Client Incidents through Big Data Predictive Analytics. A key concept known as symptom indicating a group of client events that are essentially identical to each other except that they occurred at different times and on different machines, has been used throughout the paper.

The achievements

The client incident PoC helped the Intel enterprise IT achieve the following:

Create a Solution Architecture based on predictive analytics to derive value from millions of Windows event logs and customer incident reports generated by 95,000+ client systems
Reach 78 percent accuracy in predicting the occurrence of incidents by scanning millions of events and thousands of incidents.
Apply advanced natural language processing and information retrieval techniques for correlating event log data with incident reports.
Use data visualizations tools to determine the likelihood, severity, and distribution of a problem and provide solutions.

You can read the complete white paper here: Reducing Client Incidents through Big Data Predictive Analytics

Big Data-Enabled Predictive Analytics to Tackle Client Incidents at Intel

The goal

The solution methodology

The achievements

Big Data Goes to the Movies