What is multicollinearity?
The term multicollinearity refers to the condition in which two or more predictors are highly correlated with one another.
Why is multicollinearity a problem?
In a regression context, multicollinearity can make it difficult to determine the effect of each predictor on the response, and can make it challenging to determine which variables to include in the model. Multicollinearity can also cause other problems:
- The coefficients might be poorly estimated, or inflated.
- The coefficients might have signs that don’t make sense.
- The standard errors for these coefficients might be inflated.
For illustration, we take a look at a new example, Bodyfat. This data set includes measurements of 252 men. The goal of the study was to develop a model, based on physical measurements, to predict percent body fat. We focus on a subset of the potential predictors: Weight (in pounds), Height (in inches), and BMI (Body Mass Index).
Weight is highly correlated with BMI, and is moderately correlated with Height.
We fit a model to predict Fat% as a function of these three variables. BMI and Weight are significant, and Height is borderline significant.
But BMI is a function of both Weight and Height. So, there is some redundant information in these predictors.
What happens if we remove BMI from the model?
Notice how the parameter estimates for Weight and Height have changed.
- The coefficient for Weight changed from negative to positive.
- The coefficient for Height changed from positive to negative.
Both Weight and Height are also now highly significant. Another dramatic change is in the accuracy of the estimates. The standard errors for Weight and Height are much larger in the model containing BMI.
Detecting multicollinearity with the variance inflation factor (VIF)
When we fit a model, how do we know if we have a problem with multicollinearity? As we've seen, a scatterplot matrix can point to pairs of variables that are correlated. But multicollinearity can also occur between many variables, and this might not be apparent in bivariate scatterplots.
One method for detecting whether multicollinearity is a problem is to compute the variance inflation factor, or VIF. This is a measure of how much the standard error of the estimate of the coefficient is inflated due to multicollinearity. The VIF for a predictor is calculated using this formula.
For a given predictor variable, a regression model is fit using that variable as the response and all the other variables as predictors. The RSquare for this model is calculated, and the VIF is computed. This is repeated for all predictors. The smallest possible value of VIF is 1.0, indicating a complete absence of multicollinearity. Statisticians use the term orthogonal to refer to variables that are completely uncorrelated with one another.
A VIF for a predictor of 10.0 corresponds to an RSquare value of 0.90. Likewise, a VIF of 100 corresponds to an RSquare of 0.99. This would mean that the other predictors explain 99% of the variation in the given predictor. In most cases, there will be some amount of multicollinearity. As a rule of thumb, a VIF of 5 or 10 indicates that the multicollinearity might be problematic. In our example, the VIFs are all very high, indicating that multicollinearity is indeed an issue.
After we remove BMI from the model, the VIFs are now very low.
In some cases, multicollinearity can be resolved by removing a redundant term from the model. In more severe cases, simply removing a term will not address the issue. In these cases, more advanced techniques such as principal component analysis (PCA) or partial least squares (PLS) might be appropriate. Other modeling approaches, such as tree-based methods and penalized regression, are also recommended.