As an example for a continuous response, use the Boston Housing.jmp data table. Assign mvalue to the Y, Response role. Assign all the other variables to the X, Factor role. Set the Validation Portion to 0 so that your results match those shown here. If using JMP Pro, select Decision Tree from the Method menu. Click OK. The initial report displays the partition graph, control buttons, a summary panel, and the first node of the tree (Decision Tree Initial Report for a Continuous Response).
Decision Tree Initial Report for a Continuous Response
The Split button is used to partition the data, creating a tree of partitions. Repeatedly splitting the data results in branches and leaves of the tree. This can be thought of as growing the tree. The Prune button is used to combine the most recent split back into one group.
Initially, all rows are in one branch. For each column, the Candidates report gives details about the optimal split. In order to determine the split, each X column, and all possible splits for that column, are considered. The columns of the Candidate report are:
The LogWorth statistic, defined as -log10(p-value). The optimal split is the one that maximizes the LogWorth. See Statistical Details for additional details.
As shown in Candidates Report (Continuous Response), the rooms column has the largest LogWorth and therefore defines the optimum split. The Cut Point value of 6.943 indicates that the split is into the nodes: rooms < 6.943 and rooms > 6.943.
Candidates Report (Continuous Response)
Click the Split button and notice the first split is made on the column rooms, at a value of 6.943. Open the two new candidate reports (First Split (Continuous Response)).
First Split (Continuous Response)
A left leaf, corresponding to rooms < 6.943, has 430 observations.
A right leaf, corresponding to rooms 6.943, has 76 observations.
For the left leaf, the next split would happen on the column lstat, which has an SS of 7,311.85. For the right leaf, the next split would happen on the column rooms, which has an SS of 3,060.96. Because the SS for the left leaf is higher, using the Split button again produces a split on the left leaf, on the column lstat.
Click the Split button to make the next split (Second Split (Continuous Response)).
Second Split (Continuous Response)
A left leaf, corresponding to lstat 14.43, has 175 observations.
A right leaf, corresponding to lstat < 14.43, has 255 observations.
A leaf corresponding to rooms < 6.943 and lstat 14.43.
A leaf corresponding to rooms < 6.943 and lstat < 14.43.
A leaf corresponding to rooms 6.943.
If validation is not used, the platform is purely interactive. Click the Split button to perform splits. Hold the Shift key as you click the Split button to specify multiple splits. If validation is not enabled, Partition is an exploratory platform intended to help you investigate relationships interactively.
As an example for a categorical response, use the Car Poll.jmp data table. Assign country to the Y, Response role. Assign all the other variables to the X, Factor role. Set the Validation Portion to 0 so that your results agree with those shown here. If using JMP Pro, select Decision Tree from the Method menu. Click OK.
In the report, select Display Options > Show Split Prob. Click Split twice. The report is shown in Decision Tree Report for a Categorical Response.
The G2 statistic is given instead of the Mean and Std Dev at the top of each leaf, and instead of SS in the Candidates report. See Statistical Details for more information about G2.
The Prob statistic is the predicted value (a probability) for each response level. See Statistical Details for more information about the Prob statistic. (Select Display Options > Show Split Prob.)
The Color Points button appears. This colors the points on the plot according to the response levels.
Decision Tree Report for a Categorical Response
Window for the Split Specific Command
The Split at menu has the following options:
Optimal Value splits at the optimal value of the selected variable.
Specified Value enables you to specify the level where the split takes place.
Output Split Table produces a data table showing all possible splits and their associated split value.
Show Points shows or hides the points. For categorical responses, this option shows the points or colored panels.
Show Tree shows or hides the large tree of partitions.
Show Graph shows or hides the partition graph.
Show Split Bar shows or hides the colored bars showing the split proportions in each leaf. This is for categorical responses only.
Show Split Stats shows or hides the split statistics. See Statistical Details for more information about the categorical split statistic G2.
Show Split Prob shows or hides the Rate and Prob statistics. This is for categorical responses only.
JMP automatically shows the Rate and Prob statistics when you select Show Split Count. See Statistical Details for more information about Rate and Prob.
Show Split Count shows or hides each frequency level for all nodes in the tree. This is for categorical responses only.
When you select this option, JMP automatically selects Show Split Prob. And when you deselect Show Split Prob, the counts do not appear.
Show Split Candidates shows or hides the Candidates report.
Sort Split Candidates sorts the candidates report by the statistic or the log(worth), whichever is appropriate. This option can be turned on and off. When off, it does not change any reports, but new candidate reports are sorted in the order the X terms are specified, rather than by a statistic.
splits the tree at the optimal split point. This is the same action as the Split button.
shows a plot of R2 versus the number of splits. If you use validation, separate curves are drawn for training and validation R2.
is described in the section ROC Curve. This is for categorical responses only.
is described in the section Lift Curves. This is for categorical responses only.
Entropy RSquare compares the log-likelihoods from the fitted model and the constant probability model.
Generalized RSquare is a generalization of the Rsquare measure that simplifies to the regular Rsquare for continuous normal responses. It is similar to the Entropy RSquare, but instead of using the log-likelihood, it uses the 2/n root of the likelihood. It is scaled to have a maximum of 1. The value is 1 for a perfect model, and 0 for a model no better than a constant model.
Mean -Log p is the average of -log(p), where p is the fitted probability associated with the event that occurred.
RMSE is the root mean square error, where the differences are between the response and p (the fitted probability for the event that actually occurred).
Mean Abs Dev is the average of the absolute values of the differences between the response and p (the fitted probability for the event that actually occurred).
Misclassification Rate is the rate for which the response category with the highest fitted probability is not the observed category.
Save Residuals saves the residual values from the model to the data table.
Save Predicteds saves the predicted values from the model to the data table.
Save Leaf Numbers saves the leaf numbers of the tree to a column in the data table.
Save Leaf Labels saves leaf labels of the tree to the data table. The labels document each branch that the row would trace along the tree, with each branch separated by “&”. An example label could be “size(Small,Medium)&size(Small)”. However, JMP does not include redundant information in the form of category labels that are repeated. A category label for a leaf might refer to an inclusive list of categories in a higher tree node. A caret (‘^”) appears where the tree node with redundant labels occurs. Therefore, “size(Small,Medium)&size(Small)” is presented as ^&size(Small).
Save Prediction Formula saves the prediction formula to a column in the data table. The formula consists of nested conditional clauses that describe the tree structure. If the response is continuous, the column contains a Predicting property. If the response is categorical, the column contains a Response Probability property.
Save Tolerant Prediction Formula saves a formula that predicts even when there are missing values and when Informative Missing has not been checked. The prediction formula tolerates missing values by randomly allocating response values for missing predictors to a split. If the response is continuous, the column contains a Predicting property. If the response is categorical, the column contains a Response Probability property. If you have checked Informative Missing, you can save the Tolerant Prediction Formula by holding the Shift key as you click on the report’s red triangle.
Save Leaf Number Formula saves a column containing a formula in the data table that computes the leaf number.
Save Leaf Label Formula saves a column containing a formula in the data table that computes the leaf label.
Make SAS DATA Step creates SAS code for scoring a new data set.
When you select Save Columns > Save Prediction Formula from the report’s red triangle menu, additional columns with formulas are saved to the data table. These columns are:
Profit for <level>: For each level of the response, a column gives the expected profit for classifying each observation into that level.
Most Profitable Prediction for <column name>: For each observation, gives the level of the response with the highest expected profit.
Expected Profit for <column name>: For each observation, gives the expected profit for the classification defined by the Most Profitable Prediction column.
Actual Profit for <column name>: For each observation, gives the actual profit for classifying that observation into the level specified by the Most Profitable Prediction column.
The Go button (shown in The Go Button) appears when you enable validation. For more information about using validation, see Validation.
The Go button provides for repeated splitting without having to repeatedly click the Split button. When you click the Go button, the platform performs repeated splitting until the validation R-Square is better than what the next 10 splits would obtain. This rule might produce complex trees that are not very interpretable, but have good predictive power.
Using the Go button turns on the Split History command. If using the Go button results in a tree with more than 40 nodes, the Show Tree command is turned off.
The Go Button
Another word for bootstrap-averaging is bagging. Those observations included in the growing of a tree are called the in-bag sample, abbreviated IB. Those not included are called the out-of-bag sample, abbreviated OOB.
If the Bootstrap Forest method is selected on the platform launch window, the Bootstrap Forest options window appears after clicking OK. Bootstrap Forest Fitting Options shows the window using the Car Poll.jmp data table. The column country is used as the response, and the other columns are used as the predictors.
Bootstrap Forest Fitting Options
Max Number of terms is the maximum number of terms to consider for a split.
Bootstrap Forest
None does not display the Tree Views Report.
Show names displays the trees labeled with the splitting columns.
Show names categories displays the trees labeled with the splitting columns and splitting values.
Show names categories estimates displays the trees labeled with the splitting columns, splitting values, and summary statistics for each node.
is described in the section ROC Curve. This is for categorical responses only.
is described in the section Lift Curves. This is for categorical responses only.
Save Predicteds saves the predicted values from the model to the data table.
Save Prediction Formula saves the prediction formula to a column in the data table.
Save Residuals saves the residuals to the data table. This is for continuous responses only.
Save Cumulative Details creates a data table containing the fit statistics for each tree. Only available if validation is used.
Make SAS DATA Step creates SAS code for scoring a new data set.
If the Boosted Tree method is selected on the platform launch window, the Boosted Tree options window appears after clicking OK. Boosted Tree Options Window shows the options window for the Car Poll.jmp sample data table with country as Y, Response, all other columns as X, Factor, and a Validation Portion of 0.2.
Boosted Tree Options Window
is a number such that 0 < r 1. Learning rates close to 1 result in faster convergence on a final tree, but also have a higher tendency to overfit data. Use learning rates closer to 1 when a small Number of Layers is specified.
Max Splits Per Tree is the upper end for Splits per Tree.
Max Learning Rate is the upper end for Learning Rate.
Boosted Tree Report
is described in the section ROC Curve. This is for categorical responses only.
is described in the section Lift Curves. This is for categorical responses only.
Save Predicteds saves the predicted values from the model to the data table.
Save Prediction Formula saves the prediction formula to a column in the data table.
Save Residuals saves the residuals to the data table. This is for continuous responses only.
Save Offset Estimates saves the sums of the linear components. These are the logits of the fitted probabilities. This is for categorical responses only.
Save Tree Details creates a data table containing split details and estimates for each stage.
Save Cumulative Details creates a data table containing the fit statistics for each stage. Only available is validation is used.
Make SAS DATA Step creates SAS code for scoring a new data set.