Process Description

Cross Validation Model Comparison

How do you determine which statistical model is most appropriate for making predictions from your data set? One way to begin addressing this problem is simply to try fitting different models to your data, one at a time, and compare the results. However, with very wide data sets it is easy to overfit your data, so some type of cross validation is recommended. Plus, given the huge number of possible modeling variations, it can quickly become overwhelming to compare more than just a few models.

The Cross Validation Model Comparison process enables you to compare cross validation statistics for an arbitrary collection of predictive models and determine which models are best suited for prediction from that particular data set. Cross validation consists of dividing the rows of a wide data set into two groups, labeling one as the test group and the other as the training group, and then, after setting aside the test group, fitting one or more predictive models only to the training set. The fitted models derived using the training set are then evaluated with the predictor variables of the test set to obtain predicted values. These values are then compared to the observed values. This process is repeated a specified number of times, using a different training/test division, and results are summarized and displayed side-by-side.

The cross validation is conducted “honestly”. That is, predictor reduction and model fitting are performed anew, on each training set, as if the test set had never been observed. Only at the end of each iteration are the true values of the dependent variable in the test set used to assess how well the model performed.

What do I need?

To run the Cross Validation Model Comparison, your Input Data Set must be in the wide format. The appropriate data import engine as well as each of the predictive modeling processes to be used in the comparison must be configured and the settings saved in one settings folder. Finally, an output folder must be created, into which all of the resulting data sets, analyses, graphics, and other output are placed.

It is assumed that you are familiar with the Introduction to Predictive Modeling processes, have settled upon one or more of them to compare, and have saved specific settings (see Saving and Loading Settings) for each of the runs to be cross validated.

Important: The input data set and dependent variable must be identical for all predictive modeling settings to be compared. The Mode parameter (found on the Analysis tab) for each process must be set to Automated to allow processing with SAS code rather than using the interactive JMP mode. The Prior Probabilities / Prevalences parameters must also all be identical since they influence how performance statistics are computed.

A saved setting can be edited either in the dialog for that process or in the Cross Validation Model Comparison process itself. If you are not familiar with the individual processes that you want to use, consult the specific chapters for those processes for more information.

Important: Both the model comparison and respective main method setting files for any sample settings that you run must be placed in your user WorkflowResults folder1 before you run them. If you ever clear this folder, you should replenish it with the setting files from the Settings folder2.

For detailed information about the files and data sets used or created by JMP Genomics software, see Files and Data Sets.

Output/Results

The output generated by this process is summarized in a Tabbed report. Refer to the Cross Validation Model Comparison output documentation for detailed descriptions and guides to interpreting your results.