Validation

If you build a tree with enough splits, partitioning can overfit data. When this happens, the model predicts the data used to build the model very well, but predicts future observations poorly. Validation is the process of using part of a data set to estimate model parameters, and using the other part to assess the predictive ability of the model.

• The training set is the part that is used to estimate model parameters.

• The validation set is the part that assesses or validates the predictive ability of the model.

• The test set is a final, independent assessment of the model’s predictive ability. The test set is available only when using a validation column. See Launch the Partition Platform.

When a validation method is used, the Go button appears. The Go button provides for repeated splitting without having to repeatedly click the Split button. When you click the Go button, splitting occurs until the validation R-Square is better than what the next 10 splits would obtain. This rule can result in complex trees that are not very interpretable, but have good predictive power.

Using the Go button turns on the Split History command. If using the Go button results in a tree with more than 40 nodes, the Show Tree command is turned off.

The training, validation, and test sets are created by subsetting the original data into parts. Select one of the following methods to subset a data set:

Excluded Rows

Uses row states to subset the data. Rows that are unexcluded are used as the training set, and excluded rows are used as the validation set.

For more information about using row states and how to exclude rows, see Hide and Exclude Rows in Using JMP.

Holdback

Randomly divides the original data into the training and validation data sets. The Validation Portion on the platform launch window is used to specify the proportion of the original data to use as the validation data set (holdback). See Launch the Partition Platform for more information about the Validation Portion.

KFold Crossvalidation

Randomly divides the original data into K subsets. In turn, each of the K sets is used to validate the model fit on the rest of the data, fitting a total of K models. The final model is selected based on the cross validation RSquare, where a stopping rule is imposed to avoid overfitting the model. This method is useful for small data sets, because it makes efficient use of limited amounts of data. See K-Fold Crossvalidation.

Validation Column

Uses a column’s values to divide the data into subsets. A validation column must contain at most three numeric values. The column is assigned using the Validation role on the Partition launch window. See Launch the Partition Platform.

The column’s values determine how the data is split:

– If the validation column has two levels, the smaller value defines the training set and the larger value defines the validation set.

– If the validation column has three levels, the values, in order of increasing size, define the training, validation, and test sets.

If you click the Validation button with no columns selected in the Select Columns list, you can add a validation column to your data table. For more information about the Make Validation Column utility, see Make Validation Column.

Want more information? Have questions? Get answers in the JMP User Community (community.jmp.com).