My 10+ years professional track record within the European Institutions have given me the possibility to participate in major projects as the eParliament Program, MEMMOIRE, and a number of inter-institutional projects. These experiences helped me grow my skills in business process analysis, technical writing, and agile prototyping.
Driven by my passion for Business Intelligence, Machine Learning, and Data Mining, I was mostly involved in projects where I was able to implement predictive analytics and build a number of custom-developed Business Intelligence solutions that are widely used within the European Institutions.
Through these blog writings I hope to share my passion for Data Mining as well as some business and technical insights I have gathered throughout my career.
In this first article I will discuss the role of predictive analytics in Human Resources while explaining some of the basics of analytics and data mining. Human Resources is a field where predictive analytics can add enormous value.
Why HR Needs Analytics
All organisations hire employees and subcontractors. These employees work and evolve within a company, passing through different phases of their work and personal life. Employees have relationships and communication patterns and they engage in different activities that we can use to learn about them.
During the recruitment stage the candidate takes a psychoanalytic test. This test will try to describe the person’s profile and how he or she will perform within the organisation.
Once the candidate is hired, his/her manager will prepare yearly or quarterly reports on the progress of that individual. This would show how well the employee performs in executing the tasks assigned to him. Ideally these reports would be compared to the initial profile to see if expectations are met.
As businesses evolve it is a challenge to keep employees motivated and prevent them from leaving. Employees need to be motivated and challenged as they go through different phases of their careers and personal life. The Hackett Group says the cost to replace someone is between 80 and 120 percent of an employee’s fully loaded costs (in other words, total pay plus benefits and overheads). It is clear that talent is the biggest asset of a company and career guidance and talent retention are key values to business success.
An Overview of the Methodology
Our approach is to pinpoint the profiles and the points in time in which employees have a tendency to lose motivation. We will try to find patterns in our data and draft a description of employees that need attention.
Let’s imagine we have a career guidance unit within our organization that helps employees during these phases of their career. Every meeting between the employee and the career guidance unit is well documented and encoded in an internal system. This gives us the possibility to craft a 360° profile of a ‘customer’ of the unit. Once this 360° profile is defined we can use it to find potential ‘customers’ and take necessary actions early on.
Let’s first start by a short description of the different types of statistics.
Descriptive statistics will show a snapshot at a given time period and ideally we can compare them with a past situation to describe evolution.
In diagnostic statistics we can search for relations in the data and try to explain tendencies. Using obtained knowledge and existing data we can build a model and make predictions. We can predict the success rate of a candidate, the evolution of a customer, the amount of demand, etc.
Statistical reports are usually based on a static model created upon business request. Reports are generated on a regular basis to be presented to the parties concerned. New report creation is usually time and labor consuming.
Description of the Tools
One way of solving the static reports issue is using OLAP (online analytical processing) cubes. There are a number of open source projects implementing OLAP cubes, such as OLAP4J and Mondrian.
Mondrian is an engine for analytics. It accepts analytical queries and converts them into relational queries, returning the data in a form that supports analytics. But for Mondrian to be useful to business users, it needs some sort of interface and application to run it. There are a number of common ways that Mondrian can be deployed. It can be set up in an application server to run on its own and provide services. This approach tends to not be very user-friendly because it doesn’t have a nice graphical user interface. Mondrian can also be embedded in a standalone application.
Mondrian works with multidimensional cubes. A cube is a collection of dimensions and measures, all centered on a fact table. A cube is the context for a report or interactive analysis. A fact table holds the transactions and the relations to the other tables. In our case a fact table represents the appointments and the employee attributes such as age, gender, position, education, etc.
Cubes are composed of a Schema, Measures and Dimensions. Measures are the values used for the analysis, such as quantity ordered, sales, inventory, number of website visits, etc. Dimensions are attributes about the data for analysis, such as time (Yearly, Quarterly, and Monthly), geography (Branches, Location), employee demographics (Age, Gender) and so on.
Cubes tend to be a complete set of measures and attributes for doing an analysis on the set of events. For instance, if you’re interested in meetings by employee attributes, you might want to look at the number of meetings (events or measures) by employee’s age (attributes). A cube collects these things into one place, ready for analyzing and querying.
The illustration bellow shows the use of OLAP system.
In order to create a 360° profile of our “employee customers” we will identify the measures and the dimensions that correspond to our employees’ attributes. For that matter we need to find as much information as we can from every employee who visited the career guidance unit. These tables will usually hold a large amount of data and tend to become computationally expensive. Furthermore, you risk encountering performance issues once working with real-life data.
It is recommended that you create Materialized Views and refresh those using jobs on a regular basis. The Materialized View will represent the data as a flat table that we will use as a fact table.
For our example we will create a fact table called employee. Attributes such as marital status, gender, management role will be used to create the dimensions of the cube. In the real world you will need to identify the respective tables that hold the relative information and join them.
Now that we have our fact table, let’s go ahead and create the cube holding the dimensions and measures. Our cube named HR will be constructed from the Dimension Employee with the attributes : Manager Id, Employee Id, Store Id, Marital Status, Gender, Salary, Education Level, etc.
The cube is created using XML mark-up and looks like the image below. The attributes of the dimension reference to the respective columns of the table employee:
The attribute Manager Id maps to the employee table with the supervisor_id to find the Manager of the given Employee. As you can see, the name appearing on the report can be different from the column name. This gives you the liberty to provide names that are easily understood by the business users.
Within the dimensions, data can be viewed by level, such as department, store, and time. Once deployed on the OLAP, application users can drag the dimensions and measures to the canvas. There’s no need to understand the structure of the database or use a query language to do analysis.
For now we will introduce the measure Count which represents the number of employees who have been in touch with our career guidance unit. The Count measure is created by an aggregator that will count the number of employee_id in our table. Since employee_id is unique, we are confident that the measure will provide the correct number of employees.
Now that we have our first Dimension and Measure let’s go ahead and see what we can achieve using OLAP.
We would like to structure the data so that we can have an overview of the number of employees grouped by Gender and Positon. Thus, we can answer questions like:
- How many male Store Managers have been in touch with the Carreer guidance unit?
To achieve this we drag the Measure, Count, into the canvas. We do the same for the Dimensions: Positions Title and Gender.
The data is retrieved from the table Employee and grouped by the Attributes Gender and Position Title. Now we have the number of Men and Women and their respective roles within the organisation, all of whom have visited our career guidance unit.
As you can see it is really easy to handle your data and structure it as you find fit. Let’s go ahead and add more attributes from the Dimension Employee to the grid. We are going to drag the Education Level and the Marital Status (M=Married, S=Single) to the grid. The grid and chart generated gives a nice summary of the profile of the employees, so we can start to build a basic idea of the profile of our potential “customers.” The result can be seen in the image below.
We can easely swap the axes to represent the same grid from another perspective.
After adding the Pay Type and Management Role to the grid we can start to see some tendencies in our data. For instance there are twice as many female, married VP Country Managers as their male counterparts who visited our career guidance unit. Thus, it’s obvious that before making conclusions, one would need to account for this finding.
At this point it would be interesting to know how tendencies evolve over time. We will need to create the Dimension Time, which requires us to translate the Date field into Years, Quarters, Months and Days. We could create separate columns based on the meeting date.
TO_CHAR (meeting.date, ‘YYYY’) AS year,
TO_CHAR (meeting.date, ‘MM’) AS month,
TO_CHAR (meeting.date, ‘DD’) AS day,
TO_CHAR (meeting.date, ‘W’) AS week,
Another way is to create a separate table holding the fields time dimension attributes (Year, Quarter, Month) and reference them using the “time_id“.
CREATE CACHED TABLE PUBLIC.”time_by_day“(
“time_id” INT NOT NULL,
“the_date” DATETIME DEFAULT NULL,
“the_day” VARCHAR(30) DEFAULT NULL,
“the_month” VARCHAR(30) DEFAULT NULL,
“the_year” SMALLINT DEFAULT NULL,
“day_of_month” SMALLINT DEFAULT NULL,
“week_of_year” INT DEFAULT NULL,
“month_of_year” SMALLINT DEFAULT NULL,
“quarter” VARCHAR(30) DEFAULT NULL,
“fiscal_period” VARCHAR(30) DEFAULT NULL);
We can now add the Dimension Time in our HR Cube. We reference to the time_by_day table with type=’TIME’ and attributes Year, Quarter and so on.
The Dimension Time is represented on the panel as such:
Let’s group the grid by year and quarters. For the sake of simplicity the image shows only Q1, Q2, and year 1998.
What have we achieved so far?
We created the OLAP HR Cube that holds the dimensions and measures needed to create a 360° profile of employees needing career guidance. We are now able to create reports and dashboards using drag and drop. We are able to structure the data and discover tendencies. We can generate reports on the fly that are easily readable by the business. In the next blog post, I will explain how to make predictions based on the 360° profile we just created.