As an example for a continuous response, use the Boston Housing.jmp data table. Assign mvalue to the Y, Response role. Assign all the other variables to the X, Factor role. Set the Validation Portion to 0 so that your results match those shown here. If using JMP Pro, select Decision Tree from the Method menu. Click OK. The initial report displays the partition graph, control buttons, a summary panel, and the first node of the tree (Decision Tree Initial Report for a Continuous Response).
Decision Tree Initial Report for a Continuous Response
The Split button is used to partition the data, creating a tree of partitions. Repeatedly splitting the data results in branches and leaves of the tree. This can be thought of as growing the tree. The Prune button is used to combine the most recent split back into one group.
Initially, all rows are in one branch. For each column, the Candidates report gives details about the optimal split. In order to determine the split, each X column, and all possible splits for that column, are considered. The columns of the Candidate report are:
The LogWorth statistic, defined as -log10(p-value). The optimal split is the one that maximizes the LogWorth. See Statistical Details for additional details.
As shown in Candidates Report (Continuous Response), the rooms column has the largest LogWorth and therefore defines the optimum split. The Cut Point value of 6.943 indicates that the split is into the nodes: rooms < 6.943 and rooms > 6.943.
Candidates Report (Continuous Response)
Click the Split button and notice the first split is made on the column rooms, at a value of 6.943. Open the two new candidate reports (First Split (Continuous Response)).
First Split (Continuous Response)
A left leaf, corresponding to rooms < 6.943, has 430 observations.
A right leaf, corresponding to rooms 6.943, has 76 observations.
For the left leaf, the next split would happen on the column lstat, which has an SS of 7,311.85. For the right leaf, the next split would happen on the column rooms, which has an SS of 3,060.96. Because the SS for the left leaf is higher, using the Split button again produces a split on the left leaf, on the column lstat.
Click the Split button to make the next split (Second Split (Continuous Response)).
Second Split (Continuous Response)
A left leaf, corresponding to lstat 14.43, has 175 observations.
A right leaf, corresponding to lstat < 14.43, has 255 observations.
A leaf corresponding to rooms < 6.943 and lstat 14.43.
A leaf corresponding to rooms < 6.943 and lstat < 14.43.
A leaf corresponding to rooms 6.943.
If validation is not used, the platform is purely interactive. Click the Split button to perform splits. Hold the Shift key as you click the Split button to specify multiple splits. If validation is not enabled, Partition is an exploratory platform intended to help you investigate relationships interactively
As an example for a categorical response, use the Car Poll.jmp data table. Assign country to the Y, Response role. Assign all the other variables to the X, Factor role. Set the Validation Portion to 0 so that your results agree with those shown here. If using JMP Pro, select Decision Tree from the Method menu. Click OK.
In the report, select Display Options > Show Split Prob. Click Split twice. The report is shown in Decision Tree Report for a Categorical Response.
The G2 statistic is given instead of the Mean and Std Dev at the top of each leaf, and instead of SS in the Candidates report. See Statistical Details for more information about G2.
The Prob statistic is the predicted value (a probability) for each response level. See Statistical Details for more information about the Prob statistic. (Select Display Options > Show Split Prob.)
The Color Points button appears. This colors the points on the plot according to the response levels.
Decision Tree Report for a Categorical Response
Window for the Split Specific Command
The Split at menu has the following options:
Optimal Value splits at the optimal value of the selected variable.
Specified Value enables you to specify the level where the split takes place.
Output Split Table produces a data table showing all possible splits and their associated split value.
Show Points shows or hides the points. For categorical responses, this option shows the points or colored panels.
Show Tree shows or hides the large tree of partitions.
Show Graph shows or hides the partition graph.
Show Split Bar shows or hides the colored bars showing the split proportions in each leaf. This is for categorical responses only.
Show Split Stats shows or hides the split statistics. See Statistical Details for more information about the categorical split statistic G2.
Show Split Prob shows or hides the Rate and Prob statistics. This is for categorical responses only.
JMP automatically shows the Rate and Prob statistics when you select Show Split Count. See Statistical Details for more information about Rate and Prob.
Show Split Count shows or hides each frequency level for all nodes in the tree. This is for categorical responses only.
When you select this option, JMP automatically selects Show Split Prob. And when you deselect Show Split Prob, the counts do not appear.
Show Split Candidates shows or hides the Candidates report.
Sort Split Candidates sorts the candidates report by the statistic or the log(worth), whichever is appropriate. This option can be turned on and off. When off, it does not change any reports, but new candidate reports are sorted in the order the X terms are specified, rather than by a statistic.
is described in the section ROC Curve. This is for categorical responses only.
is described in the section Lift Curves. This is for categorical responses only.
Entropy RSquare compares the log-likelihoods from the fitted model and the constant probability model.
Generalized RSquare is a measure that can be applied to general regression models. It is based on the likelihood function L and is scaled to have a maximum value of 1. The Generalized RSquare measure simplifies to the traditional RSquare for continuous normal responses in the standard least squares setting. Generalized RSquare is also known as the Nagelkerke or Craig and Uhler R2, which is a normalized version of Cox and Snell’s pseudo R2. See Nagelkerke (1991).
Mean -Log p is the average of -log(p), where p is the fitted probability associated with the event that occurred.
RMSE is the root mean square error, where the differences are between the response and p (the fitted probability for the event that actually occurred).
Mean Abs Dev is the average of the absolute values of the differences between the response and p (the fitted probability for the event that actually occurred).
Misclassification Rate is the rate for which the response category with the highest fitted probability is not the observed category.
Save Residuals saves the residual values from the model to the data table.
Save Predicteds saves the predicted values from the model to the data table.
Save Leaf Numbers saves the leaf numbers of the tree to a column in the data table.
Save Leaf Labels saves leaf labels of the tree to the data table. The labels document each branch that the row would trace along the tree. Each branch is separated by “&”. An example label could be “size(Small,Medium)&size(Small)”. However, JMP does not include redundant information in the form of category labels that are repeated. A category label for a leaf might refer to an inclusive list of categories in a higher tree node. A caret (‘^”) appears where the tree node with redundant labels occurs. Therefore, “size(Small,Medium)&size(Small)” is presented as ^&size(Small).
Save Prediction Formula saves the prediction formula to a column in the data table. The formula consists of nested conditional clauses that describe the tree structure. If the response is continuous, the column contains a Predicting property. If the response is categorical, the column contains a Response Probability property.
Save Tolerant Prediction Formula saves a formula that predicts even when there are missing values and when Informative Missing has not been checked. The prediction formula tolerates missing values by randomly allocating response values for missing predictors to a split. If the response is continuous, the column contains a Predicting property. If the response is categorical, the column contains a Response Probability property. If you have checked Informative Missing, you can save the Tolerant Prediction Formula by holding the Shift key as you click on the report’s red triangle.
Save Leaf Number Formula saves a column containing a formula in the data table that computes the leaf number.
Save Leaf Label Formula saves a column containing a formula in the data table that computes the leaf label.
Make SAS DATA Step creates SAS code for scoring a new data set.
When you specify a profit matrix and select Save Columns > Save Prediction Formula from the report’s red triangle menu, additional columns with formulas are saved to the data table. These columns are:
Profit for <level>: For each level of the response, a column gives the expected profit for classifying each observation into that level.
Most Profitable Prediction for <column name>: For each observation, gives the level of the response with the highest expected profit.
Expected Profit for <column name>: For each observation, gives the expected profit for the classification defined by the Most Profitable Prediction column.
Actual Profit for <column name>: For each observation, gives the actual profit for classifying that observation into the level specified by the Most Profitable Prediction column.
The Go button (shown in The Go Button) appears when you enable validation. For more information about using validation, see Validation.
The Go button provides for repeated splitting without having to repeatedly click the Split button. When you click the Go button, the platform performs repeated splitting until the validation R-Square is better than what the next 10 splits would obtain. This rule might produce complex trees that are not very interpretable, but have good predictive power.
Using the Go button turns on the Split History command. If using the Go button results in a tree with more than 40 nodes, the Show Tree command is turned off.
The Go Button
Another word for bootstrap-averaging is bagging. Those observations included in the growing of a tree are called the in-bag sample, abbreviated IB. Those not included are called the out-of-bag sample, abbreviated OOB.
If the Bootstrap Forest method is selected on the platform launch window, the Bootstrap Forest options window appears after clicking OK. Bootstrap Forest Fitting Options shows the window using the Car Poll.jmp data table. The column country is used as the response, and the other columns are used as the predictors.
Bootstrap Forest Fitting Options
Max Number of terms is the maximum number of terms to consider for a split.
Bootstrap Forest
None does not display the Tree Views Report.
Show names displays the trees labeled with the splitting columns.
Show names categories displays the trees labeled with the splitting columns and splitting values.
Show names categories estimates displays the trees labeled with the splitting columns, splitting values, and summary statistics for each node.
is described in the section ROC Curve. This is for categorical responses only.
is described in the section Lift Curves. This is for categorical responses only.
Save Predicteds saves the predicted values from the model to the data table.
Save Prediction Formula saves the prediction formula to a column in the data table.
Save Residuals saves the residuals to the data table. This is for continuous responses only.
Save Cumulative Details creates a data table containing the fit statistics for each tree. Only available if validation is used.
Make SAS DATA Step creates SAS code for scoring a new data set.
If the Boosted Tree method is selected on the platform launch window, the Boosted Tree options window appears after clicking OK. Boosted Tree Options Window shows the options window for the Car Poll.jmp sample data table with sex as Y, Response, all other columns as X, Factor, and a Validation Portion of 0.2.
Boosted Tree Options Window
is a number such that 0 < r 1. Learning rates close to 1 result in faster convergence on a final tree, but also have a higher tendency to overfit data. Use learning rates closer to 1 when a small Number of Layers is specified.
Max Splits Per Tree is the upper end for Splits per Tree.
Max Learning Rate is the upper end for Learning Rate.
Boosted Tree Report
is described in the section ROC Curve. This is for categorical responses only.
is described in the section Lift Curves. This is for categorical responses only.
Save Predicteds saves the predicted values from the model to the data table.
Save Prediction Formula saves the prediction formula to a column in the data table.
Save Residuals saves the residuals to the data table. This is for continuous responses only.
Save Offset Estimates saves the sums of the linear components. These are the logits of the fitted probabilities. This is for categorical responses only.
Save Tree Details creates a data table containing split details and estimates for each stage.
Save Cumulative Details creates a data table containing the fit statistics for each stage. Only available is validation is used.
Make SAS DATA Step creates SAS code for scoring a new data set.
K Nearest Neighbors
The K Nearest Neighbors method enables you to predict values of a response variable based on the responses of the k nearest rows. This method is different than the other Partition methods since the space is not partitioned by single-variable decision nodes. This method also has some drawbacks: it does not produce fitted probabilities for categorical responses, and it does not make a prediction formula that is practical for large problems.
The k nearest rows to a given row are determined by calculating the Euclidean distance between the row and each of the other rows. For a continuous response, the predicted value is the average of the responses for the k nearest rows. For a categorical response, the predicted value is the most frequent response level of the k nearest neighbors. If several levels are tied as the most frequent levels, responses are assigned from these levels at random.
If the K Nearest Neighbors method is selected on the platform launch window, after you click OK you must enter a value K for the maximum number of nearest neighbors. A model is computed for each value of k between 1 and K. The value chosen must be an integer between 1 and one less than the number of rows in the data table. After you specify a maximum k and click OK, the K Nearest Neighbors report appears.
For each response, the K Nearest Neighbors report provides summary information for the K models that are fit. It contains tables for the training set and for the validation and test sets, if you defined these using validation. The columns that appear in the summary table depend on the modeling type of the response Y. The number of rows in each table is equal to K.
When the response is categorical, the K Nearest Neighbors report also shows a confusion matrix for the smallest k value with the lowest Misclassification Rate. If you use validation, confusion matrices for the validation and test sets appear. A confusion matrix is a two-way classification of actual and predicted responses. Use the confusion matrices and the misclassification rates to guide your selection of a model.
K Nearest Neighbors Report for Tablet Production.jmp shows a K Nearest Neighbors report for predicting Lot Acceptance in the Tablet Production.jmp data table. The confusion matrix shown is for k = 3, which is the model with the lowest Misclassification Rate for the validation set.
K Nearest Neighbors Report for Tablet Production.jmp
Saves K predicted value columns to the data table. The columns are named Predicted <Y, Response> <k>. The kth column contains predictions for the model based on the k nearest neighbors, where Y, Response is the name of the response column.
Saves K columns to the data table. The columns are named RowNear <k>. For a given row, the kth column contains the row number of its kth nearest neighbor.
Caution: The row numbers in the columns RowNear <k> do not update when you reorder the rows in your data table. If you reorder the rows, the values in those columns are misleading.
Saves a column that contains a prediction formula for a specific k nearest neighbor model. Enter a value for k when prompted. The prediction formula contains all the training data, so this option might not be practical for large data tables.