As data scientists, one of the key ways we help our collaborators and clients is by sifting through large complex collections of data to help identify the underlying factors associated with a particular outcome (e.g. click-through-rate, conversion rate, predicted revenue). In practical terms, the identification of these predictive variables provides insight into how certain components of ad campaigns, customer demographics, etc., are related to outcomes. By having a better understanding of what these factors are, we can more accurately address and predict user needs.
In statistical/machine learning parlance this is commonly referred to as variable or model selection. For people who have taken a stats course on linear models or searched for methods regarding variable selection, regression or classification, you have probably come across some of the staples, such as forward, backward and stepwise selection. While these approaches work well in instances when there are a limited number of variables (on the order of ten or less), as we move to tens, hundreds, thousands and more (as in many web-scale applications) they are simply no longer practical nor reliable. From a practical standpoint, filtering out irrelevant variables allows us to focus time and resources on the factors (e.g. ad campaings, user demographics, device our site was accessed from, etc.) that are driving sales, conversion, clicks, and other important metrics.
I’ll give a quick overview of these procedures, highlight some of their issues and introduce a more “modern” framework that leverages so-called sparsity constraints that many companies, including the likes of Yahoo and Google, use to wrangle data with many variables. For this post I’ll focus on linear regression and show how to implement these various methods in R.
Classical Model Selection (this is by no means exhaustive!)
First recall that given a set of inputs measurements x1,x2,x3….xP (e.g. customer demographics, time of a day user’s logged on, etc.) where xj = (x1j,…..,xNj), the standard linear model is expressed as:
? = ß0 + ß1x1 + ß2x2 + ……. + ßpxp
where ? is some outcome interest (e.g. total sales for a particular user), ß0 is the intercept term (representing, for example, average sales) and ßj‘s are the variable weights (i.e. coefficients representing that variable’s importance/contribution to the observed outcome). The criteria by which we estimate the ßj‘s is minimizing the following loss function with respect to the ß :
Where yi‘s are the observed outcomes. The subscript i denotes a single observation (e.g. one users visit onto a site) and RSS stands for “residual sum of squares”.
Approaches to Variable Selection
Here are some of the classical approaches that assess which variables to keep using:
Forward selection: Starting with no variables in the model, add one at a time based on which is the most statistically significant. Significance is typically assessed based on one of several criteria; an individual variable’s p-value, a comparison of the model with and without that variable (a comparison of the “full” versus “reduced” model that also results in a p-value) or the use of “information criteria” such as AIC and BIC. The algorithm continues adding variables and terminates when none of the remaining variables has a significant p-value (e.g. less than 0.05).
Backward selection: Similar to forward selection, except that now we start with all of the variables in the model and remove one variable at a time, which variable is removed is, once again most commonly determined by a p-value. The algorithm terminates when all remaining variables are significant and the non-significant ones have been removed.
Stepwise selection: Is a combination of forward and backward procedures. There are number of variants of this algorithm, the most common one follows a forward procedure, but after a new variable has been added one backward selection step is taken to identify if any variables should be removed.
As previously mentioned, while the methods above work sufficiently well for a handful of variables, as we move to larger sets things start getting messy. For example, even at 20 variables, using something like the stepwise approach is not only comparatively slow but also highly unreliable. Consider the fact that we’re only exploring a small subspace of the over 1 million possible models (i.e. all models with 1, 2, 3 variables etc. from 20). This issue is rapidly compounded as we move to larger numbers. What we’d ideally like is an approach that searches this enormous space of models more efficiently.
Illustration: Restaurant Revenue Prediction
To make this a little more concrete, consider the following example; a recently organized competition asked participants to predict restaurant revenue based on a collection of variables like location, demographics, etc. The data that was provided contained 137 restaurants and 42 variables. This data was to be used to build a model to predict revenue for 100,000 restaurants for which revenue information was “masked.” Since the actual revenue for those 100,000 is not publicly available, we’ll focus our analysis on the data provided.
The first takeaway is to note the large number of variables compared to the number of restaurants (this imbalance is not uncommon in webscale data where while there may be billions of observations, there are also millions of variables). Generally speaking these types of scenarios, if not carefully accounted for, tend to produce models that are highly biased, i.e. do not provide accurate prediction outside the available data (commonly referred to as “overfitting”).
The appeal of model selection procedures like those mentioned above is that they help us avoid overfitting by eliminating variables that are not predictive and end up introducing noise. This also allows for clearer insights into what’s driving revenue. For someone looking into investing in a restaurant this information might be valuable as it would tell them what factors are most critical for/correlated with success/growth.
Below we include the R (http://cran.r-project.org/) code snippet (complete code will be made available on GitHub) showing the results for a model where all the variables are left in and another where variables have been selected using the stepwise procedure (we leave it to the reader to try the “forward” and “backward” approaches).
From the results above we can see a small bump in performance and a decrease in the number of variables left in the model dropping from 42 to 29.
Sparsity-Constrained Model Selection
A more modern approach that’s become a mainstay for analyzing data with many variables is the LASSO (Least Absolute Shrinkage and Selection Operator), also referred to as “sparse” regression. The idea behind the LASSO is to leverage a class of mathematical constraints on the variable weights (i.e. coefficients) to more efficiently search through the space of models. From a practical standpoint this gives us the ability to take a large collection of variables and hone in on those that are most predictive of/relevant to the outcome of interest, e.g. revenue. This constraint takes the form:
where the left hand side of this equation is the l1-norm of the coefficient vector ß=(ß1,ß2,….ßp). At first glance it’s not exactly intuitive as to how or why this helps with model selection. To help gain some insight into what this constraint is doing it’ll be helpful to look at a picture of what the above constraint looks like compared to the standard RSS criteria.
What’s being shown here is the following; the point labeled is the unconstrained least squares estimate, the red contours surrounding it (imagine that these are coming “at you”, like in a topological map) are the corresponding RSS values for other values of ß (not the least squares estimate). The diamond region centered at the origin is the region associated with l constraint. In order for this constraint to be satisfied the RSS contours must make contact with this diamond region. The main takeaway is that the “sharp edges” of this diamond fall directly along the origin, i.e. some of the ßj‘s will have a value of exactly 0. This effectively “removes” these variables from our model.
By increasing/decreasing the value of t in the above constraint we set more/fewer coefficients set to 0. Various procedures exist to select a specific t, e.g. Mallows Cp, a procedure that looks to balance model complexity (i.e. the number of coefficients) with how well the included coefficients predict outcome.
There are a number of algorithms that allow us to solve this problem quickly, efficiently and scalably. Considerable theory has also been developed to show that under certain conditions the LASSO model (and some variants of it we’ll discuss in future posts) can recover the “true” set of predictive variables.
The big takeaway here is that for large complex problems there are some powerful modeling tools to help us gain insight into what exactly is going on. Returning to the restaurant revenue prediction problem and applying the LASSO model we have the following:
Restaurant Revenue Prediction Revisited
This is a considerable improvement (~4x) in performance over the standard linear model and the stepwise procedure. Additionally, the total number of variables remaining in the model is down to 13, allowing us to focus our efforts and dig deeper into the underlying factors driving revenue.
The standard LASSO just described is pretty powerful in and of itself, but a variety of extensions allow us to look for other, more complex sparsity patterns. In the next post I’ll discuss group LASSO, which allows us to capture sparsity patterns in things like multi-level categorical coefficients (e.g. device type a user accesses a web page from).
NB: Typically such an analysis would include resampling the data falling into the “training” and “testing” sets many times to more reliably estimate model prediction performance. What’s presented here is for illustrative purposes only.