Model Validation
What is model validation or honest assessment in predictive modeling?
Model validation, called honest assessment by statisticians, is a method of determining if a predictive model is generalizable to new data. Sometimes new data are collected, the model is applied to the new data, and then the predictive ability of the model is assessed to see if it can be used for future data. More often, it is difficult or impossible to collect new data for validation, so the original data are split into partitions, either randomly or stratified across the response variable. Models are then fit on the training partition and compared using the validation partition.
What is the bias/variance tradeoff?
Consider the data in the figure below. You’ll notice that there is one response variable and one predictor variable. A model that is too simple, like the straight line, yields biased predictions. There are areas of the predictor space that give the wrong predictions on average. A model that is too complex and does not generate good predictions is often referred to as overfitting. Simply put, underfit models gives predictions with high bias; overfit models give predictions with high variance. Finding the right balance of model complexity is critical to obtaining a good predictive model.
Methods of honest assessment
Holdout validation, also called stopped training or holdback validation, uses 50-80% of the data for training (model fitting), 20-50% of the data for validation (model comparison), and 0-30% of the data for testing (assessing model performance).
K-fold cross validation can be used if there are not enough data for a good-sized holdout set. Here, you divide the data randomly into k groups. You’ll want to designate one group as the validation set, fit the model on the other groups, and calculate fit statistics. Rotate the roles of the training groups and the holdout group until all groups have been held out once. Finally, combine the statistics of the models, or choose the best model, or both.
Model fit graphs
For a continuous response, the actual by predicted plot shows the bias and variance of the model fit. A perfect model has points on the line y = x. Deviations from the perfect model occur because of model bias and model variance as shown in the figures below. The black line represents y = x; the orange line shows the trend of model bias.
Figure 4: Actual by predicted plot showing no model bias.
Figure 5: Actual by predicted plot showing linear bias.
Figure 6: Actual by predicted plot showing nonlinear bias.
Figure 7: Actual by predicted plot showing smaller unexplained variation.
Figure 8: Actual by predicted plot showing larger unexplained variation.
For a categorical response, the receiver operating characteristic curve, or ROC curve, can help evaluate how well the model predicts the observed responses. It is based on the sensitivity and specificity of the model.
Suppose you are modeling a response with two levels: pass and fail. You are interested in the probability of failure, so a failure is considered the “positive” outcome. There are four possibilities when evaluating the model for each observation:
- A failure is correctly classified as a failure. This is a true positive.
- A failure is misclassified as a pass. This is a false negative.
- A pass is correctly classified as a pass. This is a true negative.
- A pass is misclassified as a failure. This is a false negative.
You calculate sensitivity, or the true positive rate, by dividing true failures by the total actual failures. You calculate specificity by dividing the true passes by the total actual passes. Then you calculate the false positive rate as 1 – specificity. The ROC curve plots sensitivity (true positive rate) by 1 – specificity (false positive rate).
To create an ROC curve, first sort the data by predicted probability. Then go through the data. For each observation, if the observation was correctly classified, move up the Y axis one step. If the observation was misclassified, move up the X axis one step.
The ideal model has no misclassifications, so the ideal ROC curve is a vertical line corresponding to the Y axis.
If the model classifies no better than flipping a coin, the ROC curve looks like a 45-degree line. The more the ROC curve arches up from the 45-degree line, the more accurately the model predicts compared to the random model.
Model fit statistics
| Fit statistic | Type of response | Description | Interpretation |
| R-squared or generalized R-squared | Continuous or Categorical | The ratio of the variability in the response explained by the model to the total variability in the response | R2 = 0.82 means that 82% of the variability in the response is explained by the model based on the predictor variables |
| RMSE | Continuous | Square root of the mean squared error | A measure of the noise after fitting the model, measured in the same units as the response |
| AAE | Continuous | Average absolute error | A measure of the noise after fitting the model, measured in the same units as the response |
| MAPE | Continuous | Mean absolute percentage error | A measure of the noise after fitting the model, measured as a percentage |
| RASE | Continuous | Square root of the average squared error | A measure of the noise after fitting the model, measured in the same units as the response |
| Misclassification rate | Categorical | The ratio of the number of misclassified observations to the total number of observations | How often the response category with the highest fitted probability does not match the observed category |
| Accuracy | Categorical | 1 – misclassification rate | How often the response category with the highest fitted probability matches the observed category |
| Confusion matrix | Categorical | Number or proportion of observations in each observed and predicted category | Classification of actual response levels and predicted response levels |
| AUC | Categorical | Area under the ROC curve | AUC is between 0 and 1; higher values are better |
| AICc | Continuous or Categorical | Akaike’s information criterion, a decision theoretic criterion based on penalized likelihood | For more information, see A deeper dive into likelihood. |
| BIC | Continuous or Categorical | Bayesian information criterion, a decision theoretic criterion based on penalized likelihood | For more information, see A deeper dive into likelihood. |