Validation in Partition

The use of validation with partition models is important given that partition models are easily overfit. When this happens, the model predicts the data used to build the model very well, but predicts future observations poorly. Validation is the process of using part of a data set to estimate model parameters, and using the other part to assess the predictive ability of the model. For more information about validation, see Validation in JMP Modeling.

In Partition, when a validation method is used, the Go button appears. The Go button provides for repeated splitting without having to repeatedly click the Split button. When you click the Go button, splitting occurs until the validation RSquare is better than what the next 10 splits would obtain. This rule can result in complex trees that are not very interpretable, but have good predictive power.

Using the Go button turns on the Split History command. If using the Go button results in a tree with more than 40 nodes, the Show Tree command is turned off.

Select one of the following validation methods:

Excluded Rows

Uses row states to subset the data. Rows that are unexcluded are used as the training set, and excluded rows are used as the validation set.

For more information about using row states and how to exclude rows, see Hide and Exclude Rows in Data Tables in Using JMP.

Holdback

Randomly divides the original data into the training and validation data sets. The Validation Portion on the platform launch window is used to specify the proportion of the original data to use as the validation data set (holdback). See Launch the Partition Platform for more information about the Validation Portion.

Image shown here Validation Column

Uses a numeric column that defines the validation sets. This column should contain at most three distinct values:

– If the validation column has two levels, the smaller value defines the training set and the larger value defines the validation set.

– If the validation column has three levels, the values, in order of increasing size, define the training, validation, and test sets.

– If the validation column has more than three levels, the rows that contain the smallest three values define the validation sets. All other rows are excluded from the analysis.

If you click the Validation button with no columns selected in the Select Columns list, you can add a validation column to your data table. For more information about the Make Validation Column utility, see Make Validation Column.

Tip: To use K Fold or Nested K Fold crossvalidation, fit a partition model through the Model Screening platform. See Model Screening.

Crossvalidation Report

The Crossvalidation report shows the following:

k-fold

The number of folds.

-2LogLike or SSE

Gives twice the negative log-likelihood (-2LogLikelihood) values when the response is categorical. Gives sum of squared errors (SSE) when the response is continuous. The first row gives results averaged over the folds. The second row gives results for the single model fit to all observations. For more information about the log-likelihood, see Likelihood, AICc, and BIC in Fitting Linear Models.

RSquare

The first row gives the RSquare value averaged over the folds. The second row gives the RSquare value for the single model fit to all observations.

Want more information? Have questions? Get answers in the JMP User Community (community.jmp.com).