Tim Hesterberg relates a story about asking renowned statistician Brad Efron for the most important problem in the field. Instead of something involving the bootstrap, Efron made a clear and perhaps surprising choice: variable selection in regression. When a pre-eminent statistician highlights a problem such as this, it’s advisable to take notice.
The problem is becoming even more important with each passing year. In nearly every discipline of science and engineering, modern advances in technology have made it easier to collect more and more data. For example, in a genome-wide association analysis, there are typically several hundred subjects, and on each you measure a trait of interest along with several thousand or even a million genetic markers. The goal is to model and predict the trait as a function of the markers, but with so many markers to choose from, how do you select an appropriate subset?
Efron made a clear and perhaps surprising choice: variable selection in regression.
Furthermore, particularly when there are more predictors than observations, an omnipresent danger is that of overfitting. This happens when you include too many variables in order to achieve what appears to be an excellent-fitting model, but when new observations become available the predictions are poor because the model is tuned too tightly to the original data and does not include a sufficient degree of generalizability. So how do you guard against overfitting?
Consider a prototypical data matrix, with n rows representing individuals or experimental units, one column y representing a trait or characteristic of interest, and p additional columns (denoted x1-xp) representing variables, attributes or features measured on those units. The matrix becomes wide when p > n, and often p is an order of magnitude or larger than n. In many cases when p > n we can even build a model that predicts the fitted data perfectly, but it will perform terribly for new observations. The goal of a variable selection technique is to model y as a function of some subset of x1-xp that avoids overfitting.
Three key objectives behind variable selection
- Understanding relationships among variables: Gaining an understanding of the true underlying relationships among variables in a stochastic system is perhaps the most lofty and important goal of any statistical modeling exercise. Knowledge of these relationships in the right context advances scientific know-how and enables generalizable predictions. In full form this will involve the often difficult transition from association to causality.
- Weeding out unnecessary variables: A good initial step in gaining understanding of variable relationships is to identify and eliminate variables that have no bearing on the trait of interest. Imagine driving to a new place with only a printed set of directions. In this case, a list of turns and road names is sufficient, and a detailed description of landmarks may be too much to handle. Similarly, when doing variable selection, you’re weeding out unnecessary variables that get in the way and hurt your model’s performance.
- Building a generalizable predictive model: In some cases you may not be as concerned about understanding a model as you are about it performing well. You mainly want the predictions to be as good as they can be, even if the model is somewhat of a black box. Variable selection can be an important part of building such a model.
Main methods for variable selection
Popular methods for variable selection can typically be mixed-and-matched with different kinds of statistical models and with each other. For example, we may want to use a simple statistical filtering method to reduce the number of predictors to a manageable size before using a computationally intensive method like the genetic algorithm.
The intuition behind many of the methods is similar to selecting a good sports team. Taking basketball as an example, if we only chose a team of tall centers, we would likely get a lot of rebounds and blocked shots but would have trouble bringing the ball up the court, defending against a fast break and shooting three-pointers. A good basketball team has excellent players in each role that complement each other in ball handling, defending, shooting and rebounding. A good variable selection technique balances a variety of qualities, including predictive power, interpretability and computational burden.
Choosing a variable selection method
How do you go about choosing a variable selection method along with a statistical model? An invaluable tool is cross-validation.
The idea is to intentionally set aside a random fraction of your data rows (typically 10 or 20 percent), use the remaining rows to select variables and fit a model, then test the predictions on the rows you set aside initially. While doing this once can be quite informative, it’s often advisable to repeat this process with different random subsets (e.g., k-fold cross-validation) a sufficient number of times to get an estimate of cross-validation performance variability as well. Compare this performance across a variety of variable selection and statistical modeling methods and choose the ones that perform best. Cross-validation is also a crucial piece of many of the variable selection techniques (like the lasso or trees) used to help tune the complexity of the final model.
One difficulty is that the number of possible combinations of variable selection and statistical modeling methods extends well into the millions, and it is not feasible to compute and compare them all. To navigate this large model space, it makes sense to utilize any domain expertise or prior experience you or a colleague may have with the data to select appropriate models. Besides that, it is usually a good idea to try a few combinations from each of the main variable selection and statistical modeling methods you have available, see which ones do best, then refine your search using variations with only those methods. You might even be able to exploit some design of experiments principles to help you explore the combinations more efficiently.
As you encounter larger and larger data sets, the importance of variable selection grows. It’s worth spending some time to investigate the various methods to determine which works the best in your domain of expertise.