Statistical Details

This section provides some quantitative details and other information.

The response can be either continuous, or categorical (nominal or ordinal). If Y is categorical, then it is fitting the probabilities estimated for the response levels, minimizing the residual log-likelihood chi-square [2*entropy]. If the response is continuous, then the platform fits means, minimizing the sum of squared errors.

The factors can be either continuous, or categorical (nominal or ordinal). If an X is continuous, then the partition is done according to a splitting “cut” value for X. If X is categorical, then it divides the X categories into two groups of levels and considers all possible groupings into two levels.

Splitting Criterion

Node splitting is based on the LogWorth statistic, which is reported in node Candidate reports. LogWorth is calculated as follows:

-log10(p-value)

where the adjusted p-value is calculated in a complex manner that takes into account the number of different ways splits can occur. This calculation is very fair compared to the unadjusted p-value, which favors Xs with many levels, and the Bonferroni p-value, which favors Xs with small numbers of levels. Details on the method are discussed in a white paper “Monte Carlo Calibration of Distributions of Partition Statistics” found on the JMP website www.jmp.com.

For continuous responses, the Sum of Squares (SS) is reported in node reports. This is the change in the error sum-of-squares due to the split.

A candidate SS that has been chosen is

SStest = SSparent - (SSright + SSleft) where SS in a node is just s2(n - 1).

Also reported for continuous responses is the Difference statistic. This is the difference between the predicted values for the two child nodes of a parent node.

For categorical responses, the G2 (likelihood-ratio chi-square) is shown in the report. This is actually twice the [natural log] entropy or twice the change in the entropy. Entropy is Σ -log(p) for each observation, where p is the probability attributed to the response that occurred.

A candidate G2 that has been chosen is

G2 test = G2 parent - (G2 left + G2 right).

Partition actually has two rates; one used for training that is the usual ration of count to total, and another that is slightly biased away from zero. By never having attributed probabilities of zero, this allows logs of probabilities to be calculated on validation or excluded sets of data, used in Entropy RSquares.

Predicted Probabilities in Decision Tree and Bootstrap Forest

The predicted probabilities for the Decision Tree and Bootstrap Forest methods are calculated as described below by the Prob statistic.

For categorical responses in Decision Tree, the Show Split Prob command shows the following statistics:

Rate

is the proportion of observations at the node for each response level.

Prob

is the predicted probability for that node of the tree. The method for calculating Prob for the ith response level at a given node is as follows:

Probi =

where the summation is across all response levels; ni is the number of observations at the node for the ith response level; and priori is the prior probability for the ith response level, calculated as

priori = λpi+ (1-λ)Pi

where pi is the priori from the parent node, Pi is the Probi from the parent node, and λ is a weighting factor currently set at 0.9.

The estimate, Prob, is the same that would be obtained for a Bayesian estimate of a multinomial probability parameter with a conjugate Dirichlet prior.

The method for calculating Prob assures that the predicted probabilities are always nonzero.