Partition Method

The Partition platform provides four methods for producing a final tree:

•	For the Decision Tree method, see Decision Tree.

•	For the Bootstrap Forest method, see Bootstrap Forest.

•	For the Boosted Tree method, see Boosted Tree.

•	For the K Nearest Neighbors method, see K Nearest Neighbors.

Decision Tree

The Decision Tree method makes a single pass through the data and produces a single tree. You can interactively grow the tree one split at a time, or grow the tree automatically if validation is used. Because the reports for continuous and categorical responses differ, details are presented separately.

Decision Tree Report for Continuous Responses

As an example for a continuous response, use the Boston Housing.jmp data table. Assign mvalue to the Y, Response role. Assign all the other variables to the X, Factor role. Set the Validation Portion to 0 so that your results match those shown here. If using JMP Pro, select Decision Tree from the Method menu. Click OK. The initial report displays the partition graph, control buttons, a summary panel, and the first node of the tree (Decision Tree Initial Report for a Continuous Response).

Decision Tree Initial Report for a Continuous Response

The Split button is used to partition the data, creating a tree of partitions. Repeatedly splitting the data results in branches and leaves of the tree. This can be thought of as growing the tree. The Prune button is used to combine the most recent split back into one group.

Summary Panel (Continuous Response)

RSquare

The current value of R2.

RMSE

The root mean square error.

The number of observations (if no Freq variable is used).

Number of Splits

The current number of splits.

AICc

The corrected Akaike’s Information Criterion. For more details, see Fitting Linear Models.

Node (Continuous Response)

Count

The number of rows in the branch.

Mean

The average response for all rows in that branch.

Std Dev

The standard deviation of the response for all rows in that branch.

Candidates (Continuous Response)

Initially, all rows are in one branch. For each column, the Candidates report gives details about the optimal split. In order to determine the split, each X column, and all possible splits for that column, are considered. The columns of the Candidate report are:

Term

Shows the candidate columns.

Candidate SS

Sum of squares for the best split. Shown if the response is continuous.

Candidate G^2

Likelihood ratio chi-square for the best split. Shown if the response is categorical.

LogWorth

The LogWorth statistic, defined as -log10(p-value). The optimal split is the one that maximizes the LogWorth. See Statistical Details for additional details.

Cut Point

For a continuous term, the single value that determines the split is given. For a categorical term, the levels in the left-most split are listed.

As shown in Candidates Report (Continuous Response), the rooms column has the largest LogWorth and therefore defines the optimum split. The Cut Point value of 6.943 indicates that the split is into the nodes: rooms < 6.943 and rooms > 6.943.

Candidates Report (Continuous Response)

The optimum split is noted by an asterisk. However, there are cases where the Candidate SS is higher for one variable, but the Logworth is higher for a different variable. In this case > and < are used to point in the best direction for each variable. The asterisk corresponds to the condition where they agree. See Statistical Details for more information about LogWorth and SS.

Click the Split button and notice the first split is made on the column rooms, at a value of 6.943. Open the two new candidate reports (First Split (Continuous Response)).

First Split (Continuous Response)

The original 506 observations are now split into two parts:

•	A left leaf, corresponding to rooms < 6.943, has 430 observations.

•	A right leaf, corresponding to rooms ≥ 6.943, has 76 observations.

For the left leaf, the next split would happen on the column lstat, which has an SS of 7,311.85. For the right leaf, the next split would happen on the column rooms, which has an SS of 3,060.96. Because the SS for the left leaf is higher, using the Split button again produces a split on the left leaf, on the column lstat.

Click the Split button to make the next split (Second Split (Continuous Response)).

Second Split (Continuous Response)

The 430 observations from the previous left leaf are now split into two parts:

•	A left leaf, corresponding to lstat ≥ 14.43, has 175 observations.

•	A right leaf, corresponding to lstat < 14.43, has 255 observations.

The 506 original observations are now split into three parts:

•	A leaf corresponding to rooms < 6.943 and lstat ≥ 14.43.

•	A leaf corresponding to rooms < 6.943 and lstat < 14.43.

•	A leaf corresponding to rooms ≥ 6.943.

The predicted value for the observations in each leaf is the average response. The plot is divided into three sections, corresponding to the three leafs. These predicted values are shown on the plot with black lines. The points are put into random horizontal positions in each section. The vertical position is based on the response.

Stopping Rules

If validation is not used, the platform is purely interactive. Click the Split button to perform splits. Hold the Shift key as you click the Split button to specify multiple splits. If validation is not enabled, Partition is an exploratory platform intended to help you investigate relationships interactively

When validation is used, the user has the option to perform automatic splitting. This allows for repeated splitting without having to repeatedly click the Split button. See Automatic Splitting for details about the stopping rule.

Decision Tree Report for Categorical Responses

As an example for a categorical response, use the Car Poll.jmp data table. Assign country to the Y, Response role. Assign all the other variables to the X, Factor role. Set the Validation Portion to 0 so that your results agree with those shown here. If using JMP Pro, select Decision Tree from the Method menu. Click OK.

In the report, select Display Options > Show Split Prob. Click Split twice. The report is shown in Decision Tree Report for a Categorical Response.

•	The G2 statistic is given instead of the Mean and Std Dev at the top of each leaf, and instead of SS in the Candidates report. See Statistical Details for more information about G2.

•	The Rate statistic gives the proportion of observations in the leaf that are in each response level. The colored bars represent those proportions. (Select Display Options > Show Split Prob.)

•	The Prob statistic is the predicted value (a probability) for each response level. See Statistical Details for more information about the Prob statistic. (Select Display Options > Show Split Prob.)

•	The Y axis of the plot is divided into sections corresponding to the predicted probabilities of the response levels for each leaf. The predicted probabilities always sum to one across the response levels. If Display Options > Show Points is selected:

‒	Points are distributed evenly and randomly in the horizontal direction.

‒	In the vertical direction, they are distributed randomly within the box for their category.

•	The Color Points button appears. This colors the points on the plot according to the response levels.

Decision Tree Report for a Categorical Response

Node Options

This section describes the options on the red triangle menu of each node.

Split Best

finds and executes the best split at or below this node.

Split Here

splits at the selected node on the best column to split by.

Split Specific

lets you specify where a split takes place. This is useful in showing what the criterion is as a function of the cutpoint, as well as in determining custom cutpoints. After selecting this command, the following window appears.

Window for the Split Specific Command

The Split at menu has the following options:

Optimal Value splits at the optimal value of the selected variable.

Specified Value enables you to specify the level where the split takes place.

Output Split Table produces a data table showing all possible splits and their associated split value.

Prune Below

eliminates the splits below the selected node.

Prune Worst

finds and removes the worst split below the selected node.

Select Rows

selects the data table rows corresponding to this leaf. You can extend the selection by holding down the Shift key and choosing this command from another node.

Show Details

produces a data table that shows the split criterion for a selected variable. The data table, composed of split intervals and their associated criterion values, has an attached script that produces a graph for the criterion.

Lock

prevents a node or its subnodes from being chosen for a split. When checked, a lock icon is shown in the node title.

Platform Options

The section describes the options on the platform red triangle menu.

Display Options

gives a submenu consisting of items that toggle report elements on and off.

Show Points shows or hides the points. For categorical responses, this option shows the points or colored panels.

Show Tree shows or hides the large tree of partitions.

Show Graph shows or hides the partition graph.

Show Split Bar shows or hides the colored bars showing the split proportions in each leaf. This is for categorical responses only.

Show Split Stats shows or hides the split statistics. See Statistical Details for more information about the categorical split statistic G2.

Show Split Prob shows or hides the Rate and Prob statistics. This is for categorical responses only.

JMP automatically shows the Rate and Prob statistics when you select Show Split Count. See Statistical Details for more information about Rate and Prob.

Show Split Count shows or hides each frequency level for all nodes in the tree. This is for categorical responses only.

When you select this option, JMP automatically selects Show Split Prob. And when you deselect Show Split Prob, the counts do not appear.

Show Split Candidates shows or hides the Candidates report.

Sort Split Candidates sorts the candidates report by the statistic or the log(worth), whichever is appropriate. This option can be turned on and off. When off, it does not change any reports, but new candidate reports are sorted in the order the X terms are specified, rather than by a statistic.

Split Best

splits the tree at the optimal split point. This is the same action as the Split button.

Prune Worst

removes the terminal split that has the least discrimination ability. This is equivalent to hitting the Prune Button.

Minimum Size Split

presents a dialog box where you enter a number or a fractional portion of the total sample size to define the minimum size split allowed. To specify a number, enter a value greater than or equal to 1. To specify a fraction of the sample size, enter a value less than 1. The default value is set to the maximum of 5 or the floor of the number of rows divided by 10,000.

Lock Columns

reveals a check box table to enable you to interactively lock columns so that they are not considered for splitting. You can toggle the display without affecting the individual locks.

Plot Actual by Predicted

produces a plot of actual values by predicted values. This is for continuous responses only.

Small Tree View

displays a smaller version of the partition tree to the right of the scatterplot.

Tree 3D

Shows or hides a 3-D plot of the tree structure. To access this option, hold down the Shift key and click the red-triangle menu.

Leaf Report

gives the mean and count or rates for the bottom-level leaves of the report.

Column Contributions

Displays a report showing each input column’s contribution to the fit. The report also shows how many times it defined a split and the total G2 or Sum of Squares attributed to that column.

Split History

shows a plot of RSquare versus the number of splits. If you use excluded row validation, holdback validation, or a validation column, separate curves are drawn for training and validation RSquare values. The RSquare curve is blue for the training set and red for the validation set. If you select K Fold Crossvalidation, the RSquare curve for all of the data is blue and the curve for the crossvalidation RSquare is green.

K Fold Crossvalidation

shows a Cross validation report, giving fit statistics for both the training and folded sets. For more information about validation, see KFold Crossvalidation.

ROC Curve

is described in the section ROC Curve. This is for categorical responses only.

Lift Curve

is described in the section Lift Curves. This is for categorical responses only.

Show Fit Details

Appears only for categorical responses. The Fit Details report shows several measures of fit and provides a Confusion Matrix report. The measures of fit are the following:

‒	Entropy RSquare compares the log-likelihoods from the fitted model and the constant probability model.

‒

Generalized RSquare is a measure that can be applied to general regression models. It is based on the likelihood function L and is scaled to have a maximum value of 1. The Generalized RSquare measure simplifies to the traditional RSquare for continuous normal responses in the standard least squares setting. Generalized RSquare is also known as the Nagelkerke or Craig and Uhler R2, which is a normalized version of Cox and Snell’s pseudo R2. See Nagelkerke (1991).

‒	Mean -Log p is the average of -log(p), where p is the fitted probability associated with the event that occurred.

‒	RMSE is the root mean square error, where the differences are between the response and p (the fitted probability for the event that actually occurred).

‒	Mean Abs Dev is the average of the absolute values of the differences between the response and p (the fitted probability for the event that actually occurred).

‒	Misclassification Rate is the rate for which the response category with the highest fitted probability is not the observed category.

Note: For Entropy RSquare and Generalized RSquare, values closer to 1 indicate a better fit. For Mean -Log p, RMSE, Mean Abs Dev, and Misclassification Rate, smaller values indicate a better fit.

The Confusion Matrix report shows matrices for the training set and for the validation and test sets (if defined). The Confusion Matrix is a two-way classification of actual and predicted responses.

If the response has a Profit Matrix column property, or if you specify costs using the Specify Profit Matrix option, then a Decision Matrix report appears. See Decision Matrix Report.

Save Columns

is a submenu for saving model and tree results, and creating SAS code.

Save Residuals saves the residual values from the model to the data table.

Save Predicteds saves the predicted values from the model to the data table.

Save Leaf Numbers saves the leaf numbers of the tree to a column in the data table.

Save Leaf Labels saves leaf labels of the tree to the data table. The labels document each branch that the row would trace along the tree. Each branch is separated by “&”. An example label could be “size(Small,Medium)&size(Small)”. However, JMP does not include redundant information in the form of category labels that are repeated. A category label for a leaf might refer to an inclusive list of categories in a higher tree node. A caret (‘^”) appears where the tree node with redundant labels occurs. Therefore, “size(Small,Medium)&size(Small)” is presented as ^&size(Small).

Save Prediction Formula saves the prediction formula to a column in the data table. The formula consists of nested conditional clauses that describe the tree structure. If the response is continuous, the column contains a Predicting property. If the response is categorical, the column contains a Response Probability property.

Save Tolerant Prediction Formula saves a formula that predicts even when there are missing values and when Informative Missing has not been checked. The prediction formula tolerates missing values by randomly allocating response values for missing predictors to a split. If the response is continuous, the column contains a Predicting property. If the response is categorical, the column contains a Response Probability property. If you have checked Informative Missing, you can save the Tolerant Prediction Formula by holding the Shift key as you click on the report’s red triangle.

Save Leaf Number Formula saves a column containing a formula in the data table that computes the leaf number.

Save Leaf Label Formula saves a column containing a formula in the data table that computes the leaf label.

Make SAS DATA Step creates SAS code for scoring a new data set.

Specify Profit Matrix

Enables you to specify profit or costs associated with correct or incorrect classification decisions. Only available for categorical responses. You can assign profit and cost values to each combination of actual and predicted response categories. A row labeled Undecided enables you to specify the costs of classifying into an alternative category. Checking Save to Column as Property saves your assignments to the response column as a property. Not checking Save to Column as Property applies the Profit Matrix only to the current Partition report.

When you define costs using the Specify Profit Matrix option and then select Show Fit Details, a Decision Matrix report appears. See Decision Matrix Report.

When you specify a profit matrix and select Save Columns > Save Prediction Formula from the report’s red triangle menu, additional columns with formulas are saved to the data table. These columns are:

‒	Profit for <level>: For each level of the response, a column gives the expected profit for classifying each observation into that level.

‒	Most Profitable Prediction for <column name>: For each observation, gives the level of the response with the highest expected profit.

‒	Expected Profit for <column name>: For each observation, gives the expected profit for the classification defined by the Most Profitable Prediction column.

‒	Actual Profit for <column name>: For each observation, gives the actual profit for classifying that observation into the level specified by the Most Profitable Prediction column.

Color Points

colors the points based on their response level. This is for categorical responses only, and does the same thing as the Color Points button (see Decision Tree Report for Categorical Responses).

Script

contains options that are available to all platforms. See Using JMP.

Automatic Splitting

The Go button (shown in The Go Button) appears when you enable validation. For more information about using validation, see Validation.

The Go button provides for repeated splitting without having to repeatedly click the Split button. When you click the Go button, the platform performs repeated splitting until the validation R-Square is better than what the next 10 splits would obtain. This rule might produce complex trees that are not very interpretable, but have good predictive power.

Using the Go button turns on the Split History command. If using the Go button results in a tree with more than 40 nodes, the Show Tree command is turned off.

The Go Button

Bootstrap Forest

The Bootstrap Forest method makes many trees, and averages the predicted values to get the final predicted value. Each tree is grown on a different random sample (with replacement) of observations, and each split on each tree considers only a random sample of candidate columns for splitting. The process can use validation to assess how many trees to grow, not to exceed the specified number of trees.

Another word for bootstrap-averaging is bagging. Those observations included in the growing of a tree are called the in-bag sample, abbreviated IB. Those not included are called the out-of-bag sample, abbreviated OOB.

Bootstrap Forest Fitting Options

If the Bootstrap Forest method is selected on the platform launch window, the Bootstrap Forest options window appears after clicking OK. Bootstrap Forest Fitting Options shows the window using the Car Poll.jmp data table. The column country is used as the response, and the other columns are used as the predictors.

Bootstrap Forest Fitting Options

The options on the Bootstrap Forest options window are described here:

Number of rows

gives the number of observations in the data table.

Number of terms

gives the number of columns specified as predictors.

Number of trees in the forest

is the number of trees to grow, and then average together.

Number of terms sampled per split

is the number of columns to consider as splitting candidates at each split. For each split, a new random sample of columns is taken as the candidate set.

Bootstrap sample rate

is the proportion of observations to sample (with replacement) for growing each tree. A new random sample is generated for each tree.

Minimum Splits Per Tree

is the minimum number of splits for each tree.

Maximum Splits Per Tree

is the maximum number of splits for each tree.

Minimum Size Split

is the minimum number of observations needed on a candidate split.

Early Stopping

is checked by default. The process stops growing additional trees if adding more trees does not improve the validation statistic. If not checked, the process continues until the specified number of trees is reached. This option appears only if validation is used.

Multiple Fits over number of terms

is checked to create a bootstrap forest for several values of Number of terms sampled per split. The lower value is specified above by the Number of terms samples per split option. The upper value is specified by the following option:

Max Number of terms is the maximum number of terms to consider for a split.

Bootstrap Forest Report

The Bootstrap Forest report is shown in Bootstrap Forest.

Bootstrap Forest

The results on the report are described here:

Model Validation - Set Summaries

provides fit statistics for all the models fit if you selected the Multiple Fits option on the options window.

Specifications

provides information about the partitioning process.

Overall Statistics

provides fit statistics for the training, validation, and test sets.

Confusion Matrix

provides two-way classifications of actual and predicted response levels for the training, validation, and test sets. This is available only with categorical responses.

Decision Matrix

gives Decision Count and Decision Rate matrices for the training set and for validation and test sets (if defined). This report appears only if the response has a Profit Matrix column property or if you specify costs using the Specify Profit Matrix option. See Decision Matrix Report.

Cumulative Validation

provides a plot of the fit statistics versus the number of trees. The Cumulative Details report below the plot is a tabulation of the data on the plot. This is available only when validation is used.

Per-Tree Summaries

gives summary statistics for each tree.

Bootstrap Forest Platform Options

The Bootstrap Forest report red triangle menu has the following options:

Plot Actual by Predicted

Provides a plot of actual versus predicted values. This is only for continuous responses.

Column Contributions

Displays a report that shows each input column’s contribution to the fit. The report also shows how many times it defined a split and the total G2 or Sum of Squares attributed to that column.

Show Trees

Provides a menu of options for displaying the Tree Views report. The report produces a picture of each component tree.

None does not display the Tree Views Report.

Show names displays the trees labeled with the splitting columns.

Show names categories displays the trees labeled with the splitting columns and splitting values.

Show names categories estimates displays the trees labeled with the splitting columns, splitting values, and summary statistics for each node.

ROC Curve

is described in the section ROC Curve. This is for categorical responses only.

Lift Curve

is described in the section Lift Curves. This is for categorical responses only.

Save Columns

Provides a menu of options for saving model and tree results, and creating SAS code.

Save Predicteds saves the predicted values from the model to the data table.

Save Prediction Formula saves the prediction formula to a column in the data table.

Save Residuals saves the residuals to the data table. This is for continuous responses only.

Save Cumulative Details creates a data table containing the fit statistics for each tree. Only available if validation is used.

Make SAS DATA Step creates SAS code for scoring a new data set.

Specify Profit Matrix

Enables you to specify profit or costs associated with correct or incorrect classification decisions. Only available for categorical responses. See Specify Profit Matrix.

Script

contains options that are available to all platforms. See Using JMP.

Boosted Tree

Boosting is the process of building a large, additive decision tree by fitting a sequence of smaller trees. Each of the smaller trees is fit on the scaled residuals of the previous tree. The trees are combined to form the larger final tree. The process can use validation to assess how many stages to fit, not to exceed the specified number of stages.

The tree at each stage is short, typically 1-5 splits. After the initial tree, each stage fits the residuals from the previous stage. The process continues until the specified number of stages is reached, or, if validation is used, until fitting an additional stage no longer improves the validation statistic. The final prediction is the sum of the estimates for each terminal node over all the stages.

If the response is categorical, the residuals fit at each stage are offsets of linear logits. The final prediction is a logistic transformation of the sum of the linear logits over all the stages.

For categorical responses, only those with two levels are supported.

Boosted Tree Fitting Options

If the Boosted Tree method is selected on the platform launch window, the Boosted Tree options window appears after clicking OK. Boosted Tree Options Window shows the options window for the Car Poll.jmp sample data table with sex as Y, Response, all other columns as X, Factor, and a Validation Portion of 0.2.

Boosted Tree Options Window

The options on the Boosted Tree options window are:

Number of Layers

is the maximum number of stages to include in the final tree.

Splits per Tree

is the number of splits for each stage

Learning Rate

is a number such that 0 < r ≤ 1. Learning rates close to 1 result in faster convergence on a final tree, but also have a higher tendency to overfit data. Use learning rates closer to 1 when a small Number of Layers is specified.

Overfit Penalty

is a biasing parameter that helps protect against fitting probabilities equal to zero. Appears only if the response is categorical.

Minimum Size Split

is the minimum number of observations needed on a candidate split.

Early Stopping

is checked by default. The boosting process stops fitting additional stages if adding more stages does not improve the validation statistic. If not checked, the boosting process continues until the specified number of stages is reached. This option appears only if validation is used.

Multiple Fits over splits and learning rate

is checked to create a boosted tree for every combination of Splits per Tree and Learning Rate. The lower ends of the combinations are specified by the Splits per Tree and Learning Rate options. The upper ends of the combinations are specified by the following options:

Max Splits Per Tree is the upper end for Splits per Tree.

Max Learning Rate is the upper end for Learning Rate.

Boosted Tree Report

The Boosted Tree report is shown in Boosted Tree Report.

Boosted Tree Report

The results on the report are described here:

Model Validation - Set Summaries

provides fit statistics for all the models fit if you selected the Multiple Splits option on the options window.

Specifications

provides information about the partitioning process.

Overall Statistics

provides fit statistics for the training, validation, and test sets.

Confusion Matrix

provides confusion statistics for both the training, validation, and test sets. This is available only with categorical responses.

Decision Matrix

Cumulative Validation

provides a plot of the fit statistics versus the number of stages. The Cumulative Details report below the plot is a tabulation of the data on the plot. This is available only when validation is used.

Boosted Tree Platform Options

The Boosted Tree report red-triangle menu has the following options:

Show Trees

is a submenu for displaying the Tree Views report. The report produces a picture of the tree at each stage of the boosting process. For details about the options, see Show Trees.

Plot Actual by Predicted

provides a plot of actual versus predicted values. This is only for continuous responses.

Column Contributions

Displays a report showing each input column’s contribution to the fit. The report also shows how many times it defined a split and the total G2 or Sum of Squares attributed to that column.

ROC Curve

is described in the section ROC Curve. This is for categorical responses only.

Lift Curve

is described in the section Lift Curves. This is for categorical responses only.

Save Columns

is a submenu for saving model and tree results, and creating SAS code.

Save Predicteds saves the predicted values from the model to the data table.

Save Prediction Formula saves the prediction formula to a column in the data table.

Save Residuals saves the residuals to the data table. This is for continuous responses only.

Save Offset Estimates saves the sums of the linear components. These are the logits of the fitted probabilities. This is for categorical responses only.

Save Tree Details creates a data table containing split details and estimates for each stage.

Save Cumulative Details creates a data table containing the fit statistics for each stage. Only available is validation is used.

Make SAS DATA Step creates SAS code for scoring a new data set.

Specify Profit Matrix

Enables you to specify profit or costs associated with correct or incorrect classification decisions. Only available for categorical responses. See Specify Profit Matrix.

Script

contains options that are available to all platforms. See Using JMP.

K Nearest Neighbors

The K Nearest Neighbors method enables you to predict values of a response variable based on the responses of the k nearest rows. This method is different than the other Partition methods since the space is not partitioned by single-variable decision nodes. This method also has some drawbacks: it does not produce fitted probabilities for categorical responses, and it does not make a prediction formula that is practical for large problems.

K Nearest Neighbors Model

The k nearest rows to a given row are determined by calculating the Euclidean distance between the row and each of the other rows. For a continuous response, the predicted value is the average of the responses for the k nearest rows. For a categorical response, the predicted value is the most frequent response level of the k nearest neighbors. If several levels are tied as the most frequent levels, responses are assigned from these levels at random.

If the K Nearest Neighbors method is selected on the platform launch window, after you click OK you must enter a value K for the maximum number of nearest neighbors. A model is computed for each value of k between 1 and K. The value chosen must be an integer between 1 and one less than the number of rows in the data table. After you specify a maximum k and click OK, the K Nearest Neighbors report appears.

Each continuous predictor is scaled by its standard deviation. Missing values in a continuous predictor are replaced by the mean of the predictor. Each categorical predictor is expanded into indicator values. Missing cells in a categorical predictor are replaced by a value of zero.

K Nearest Neighbors Report

For each response, the K Nearest Neighbors report provides summary information for the K models that are fit. It contains tables for the training set and for the validation and test sets, if you defined these using validation. The columns that appear in the summary table depend on the modeling type of the response Y. The number of rows in each table is equal to K.

Summary Table (Continuous Response)

The summary table for a continuous response contains the following columns:

The number of nearest neighbors used in the model.

Count

The number of observations used to fit the model.

RMSE

The root mean square error.

SSE

The sum of squared error.

You can choose among the nearest neighbor models based on the value of the RMSE for each model. Smaller RMSE values indicate better fit. If you use validation, you might select a model based on the smallest RMSE value for the validation or test sets.

Summary Table (Categorical Response)

The summary table for a categorical response contains the following columns:

The number of nearest neighbors used in the model.

Count

The number of observations used to fit the model.

Misclassification Rate

The proportion of observations misclassified by the model. This is calculated as Misclassifications divided by Count.

Misclassifications

The number of observations that are incorrectly predicted by the model.

When the response is categorical, the K Nearest Neighbors report also shows a confusion matrix for the smallest k value with the lowest Misclassification Rate. If you use validation, confusion matrices for the validation and test sets appear. A confusion matrix is a two-way classification of actual and predicted responses. Use the confusion matrices and the misclassification rates to guide your selection of a model.

K Nearest Neighbors Report for Tablet Production.jmp shows a K Nearest Neighbors report for predicting Lot Acceptance in the Tablet Production.jmp data table. The confusion matrix shown is for k = 3, which is the model with the lowest Misclassification Rate for the validation set.

K Nearest Neighbors Report for Tablet Production.jmp

K Nearest Neighbors Platform Options

The K Nearest Neighbors red triangle menu contains the following options:

Save Predicteds

Saves K predicted value columns to the data table. The columns are named Predicted <Y, Response> <k>. The kth column contains predictions for the model based on the k nearest neighbors, where Y, Response is the name of the response column.

Save Near Neighbor Rows

Saves K columns to the data table. The columns are named RowNear <k>. For a given row, the kth column contains the row number of its kth nearest neighbor.

Caution: The row numbers in the columns RowNear <k> do not update when you reorder the rows in your data table. If you reorder the rows, the values in those columns are misleading.

Save Prediction Formula

Saves a column that contains a prediction formula for a specific k nearest neighbor model. Enter a value for k when prompted. The prediction formula contains all the training data, so this option might not be practical for large data tables.

Script

contains options that are available to all platforms. See Using JMP.