Model Validation

Style

section-padding-none

What is model validation or honest assessment in predictive modeling?

Model validation, called honest assessment by statisticians, is a method of determining if a predictive model is generalizable to new data. Sometimes new data are collected, the model is applied to the new data, and then the predictive ability of the model is assessed to see if it can be used for future data. More often, it is difficult or impossible to collect new data for validation, so the original data are split into partitions, either randomly or stratified across the response variable. Models are then fit on the training partition and compared using the validation partition.

What is the bias/variance tradeoff?

Consider the data in the figure below. You’ll notice that there is one response variable and one predictor variable. A model that is too simple, like the straight line, yields biased predictions. There are areas of the predictor space that give the wrong predictions on average. A model that is too complex and does not generate good predictions is often referred to as overfitting. Simply put, underfit models gives predictions with high bias; overfit models give predictions with high variance. Finding the right balance of model complexity is critical to obtaining a good predictive model.

Figure 1: Illustration of bias/variance tradeoff. This model is a spline smoother. With no smoothing, the model is too simple and predictions are biased. With too much smoothing, the model is too complex and predictions have too much variance. When the smoothing is just right, the combination of bias and variance (MSE) is minimized

Methods of honest assessment

Holdout validation, also called stopped training or holdback validation, uses 50-80% of the data for training (model fitting), 20-50% of the data for validation (model comparison), and 0-30% of the data for testing (assessing model performance).

Figure 2: Common percentages for partitioning data into training, validation, and test sets.

K-fold cross validation can be used if there are not enough data for a good-sized holdout set. Here, you divide the data randomly into k groups. You’ll want to designate one group as the validation set, fit the model on the other groups, and calculate fit statistics. Rotate the roles of the training groups and the holdout group until all groups have been held out once. Finally, combine the statistics of the models, or choose the best model, or both.

Figure 3: K-fold cross validation data partitioning with k = 6.

Model fit graphs

For a continuous response, the actual by predicted plot shows the bias and variance of the model fit. A perfect model has points on the line y = x. Deviations from the perfect model occur because of model bias and model variance as shown in the figures below. The black line represents y = x; the orange line shows the trend of model bias.

Figure 4: Actual by predicted plot showing no model bias.

Figure 5: Actual by predicted plot showing linear bias.

Figure 6: Actual by predicted plot showing nonlinear bias.

Figure 7: Actual by predicted plot showing smaller unexplained variation.

Figure 8: Actual by predicted plot showing larger unexplained variation.

For a categorical response, the receiver operating characteristic curve, or ROC curve, can help evaluate how well the model predicts the observed responses. It is based on the sensitivity and specificity of the model.

Suppose you are modeling a response with two levels: pass and fail. You are interested in the probability of failure, so a failure is considered the “positive” outcome. There are four possibilities when evaluating the model for each observation:

A failure is correctly classified as a failure. This is a true positive.
A failure is misclassified as a pass. This is a false negative.
A pass is correctly classified as a pass. This is a true negative.
A pass is misclassified as a failure. This is a false negative.

You calculate sensitivity, or the true positive rate, by dividing true failures by the total actual failures. You calculate specificity by dividing the true passes by the total actual passes. Then you calculate the false positive rate as 1 – specificity. The ROC curve plots sensitivity (true positive rate) by 1 – specificity (false positive rate).

To create an ROC curve, first sort the data by predicted probability. Then go through the data. For each observation, if the observation was correctly classified, move up the Y axis one step. If the observation was misclassified, move up the X axis one step.

The ideal model has no misclassifications, so the ideal ROC curve is a vertical line corresponding to the Y axis.

If the model classifies no better than flipping a coin, the ROC curve looks like a 45-degree line. The more the ROC curve arches up from the 45-degree line, the more accurately the model predicts compared to the random model.

Figure 9: ROC curve showing no discrimination.

Figure 10: ROC curve showing poor discrimination.

Figure 11: ROC curve showing excellent discrimination.

Model fit statistics

Fit statistic	Type of response	Description	Interpretation
R-squared or generalized R-squared	Continuous or Categorical	The ratio of the variability in the response explained by the model to the total variability in the response	R² = 0.82 means that 82% of the variability in the response is explained by the model based on the predictor variables
RMSE	Continuous	Square root of the mean squared error	A measure of the noise after fitting the model, measured in the same units as the response
AAE	Continuous	Average absolute error	A measure of the noise after fitting the model, measured in the same units as the response
MAPE	Continuous	Mean absolute percentage error	A measure of the noise after fitting the model, measured as a percentage
RASE	Continuous	Square root of the average squared error	A measure of the noise after fitting the model, measured in the same units as the response
Misclassification rate	Categorical	The ratio of the number of misclassified observations to the total number of observations	How often the response category with the highest fitted probability does not match the observed category
Accuracy	Categorical	1 – misclassification rate	How often the response category with the highest fitted probability matches the observed category
Confusion matrix	Categorical	Number or proportion of observations in each observed and predicted category	Classification of actual response levels and predicted response levels
AUC	Categorical	Area under the ROC curve	AUC is between 0 and 1; higher values are better
AICc	Continuous or Categorical	Akaike’s information criterion, a decision theoretic criterion based on penalized likelihood	For more information, see A deeper dive into likelihood.
BIC	Continuous or Categorical	Bayesian information criterion, a decision theoretic criterion based on penalized likelihood	For more information, see A deeper dive into likelihood.

layout

2 column

Style

columns-75-25, section-top-padding-xsmall