When a categorical effect has k levels, where k > 2, then it must be represented by k-1 columns.
In the Stepwise platform, categorical variables (nominal and ordinal) are coded in a hierarchical fashion. This differs from coding in other least squares fitting platforms. In hierarchical coding, the levels of the categorical variable are successively split into groups of levels that most separate the means of the response. The splitting process achieves the goal of representing a k-level categorical variable by k - 1 terms.
For a nominal variable with k levels, the k levels are split into two groups of levels that have maximum SSB. Call these two groups of levels A1 and A2, where A1 has the smaller mean and A2 has the larger mean. The two groups of levels in A1 and A2 are used to define an indicator variable with values of 1 for the levels in A1 and -1 for the levels in A2. This variable is the first hierarchical term for the nominal variable.
When you use the Combine rule or the Restrict rule, a term cannot enter the model unless all the terms above it in the hierarchy have been entered. When you use the Whole Effects rule and enter a term for a categorical variable, all of its associated terms are entered. For an example, see Construction of Hierarchical Terms in Example.
This example uses data on movies that were released in 2011. You are particularly interested in the World Gross values, which represent the gross receipts. Your potential predictors are Rotten Tomatoes Score, Audience Score, and Genre. The two score variables are continuous, but Genre is nominal. Before you attempt to reduce your model using Stepwise, you want to explore the variables of interest.
1.
Select Help > Sample Data Library and open Hollywood Movies.jmp.
2.
Select Analyze > Distribution.
3.
Select Genre and click Y, Columns.
4.
Distribution of Genre
Note that Genre has nine levels, and so would be represented by eight model terms. Further data exploration will reveal that, because of missing data, only eight levels are considered by Stepwise.
5.
In the data table’s Columns panel, select the columns of interest: Rotten Tomatoes Score, Audience Score, Genre, and World Gross.
6.
Selects Cols > Modeling Utilities > Explore Missing Values.
Missing Columns Report
Note that Rotten Tomatoes Score is missing in 2 rows, Audience Score is missing in 1 row, and World Gross is missing in 2 rows.
8.
Click Select Rows.
9.
Select Analyze > Fit Model.
10.
Select Rotten Tomatoes Score, Audience Score, and Genre and click Add.
If you fit a standard least squares model to World Gross using Rotten Tomatoes Score, Audience Score, and Genre as predictors, the residuals are highly homoskedastic. (This is typical of financial data.) Use a log transformation to better satisfy the regression assumption of equal variance.
11.
Right-click on World Gross in the Select Columns list and select Transform > Log.
The transformed variable Log[World Gross] appears at the bottom of the Select Columns list.
12.
Select Log[World Gross] and click Y.
13.
Select Stepwise from the Personality list.
14.
Click Run.
Current Estimates Table Showing List of Model Terms
In the Current Estimates table, note that Genre is represented by 7 terms. You will construct a model using two of these to see how these terms are defined.
15.
Check the boxes under Entered next to the first two terms for Genre:
16.
Click Make Model.
Recall that because of missing values, Genre is a nominal variable with eight levels. In the Current Estimates table, Genre is represented by seven terms. This is appropriate, because Genre has eight levels. The first two terms that represent Genre are described below. Subsequent terms are defined in a similar fashion.
The first term that appears is Genre{Drama&Thriller&Horror&Fantasy&Romance&Comedy-Action&Animation}. This variable has the form Genre{A1 - A2}, where A1 and A2 are separated by a minus sign. The notation indicates that the maximum separation in terms of sum of squares between groups occurs between the following two sets of levels:
If you include the term Genre{Drama&Thriller&Horror&Fantasy&Romance&Comedy-Action&Animation} in a model, a column representing that term is added to the data table. In the example, you saved this column to the data table. The column shows the following values:
The second term that appears is Genre{Drama-Thriller&Horror&Fantasy&Romance&Comedy}. This set of levels is entirely contained in the first split for the first term (A1). The notation contrasts the levels:
Tree Showing Splits Used in Hierarchical Coding
When you use the Combine rule or the Restrict rule, a term cannot enter the model unless all the terms above it in the hierarchy have been entered. For example, if you enter Genre{Action-Animation}, then JMP will enter Genre{Drama&Thriller&Horror&Fantasy&Romance&Comedy-Action&Animation} as well.
When you use the Whole Effects rule and enter any one of the Genre terms, all of the Genre terms are entered.
A simple model examines at the cost per ounce ($/oz) of hot dogs as a function of the Type of hot dog (Meat, Beef, Poultry) and the Size of the hot dog (Jumbo, Regular, Hors d’oeuvre).
1.
Select Help > Sample Data Library and open Hot Dogs2.jmp.
2.
Select Analyze > Fit Model.
3.
Select $/oz and click Y.
4.
Select Type and Size and click Add.
5.
For Personality, select Stepwise.
6.
Click Run.
7.
For Stopping Rule, select P-value Threshold.
8.
For Rules, select Restrict.
Stepwise Control Panel with P-value Threshold and Restrict Rule
Notice that when you change from the default Rule of Combine to Restrict, the F Ratio and Prob > F values for two terms are shown as missing. These are the terms Type{Poultry-Meat} and Size{Regular-Jumbo}. This is because these two terms cannot enter the model until their precedent terms enter.
9.
Click Step.
The term Type{Poultry&Meat-Beef} enters the model. This term has the smallest Prob>F value, and that value falls below the Prob to Enter threshold of 0.25.
Stepwise Control Panel with One Term Entered
The the F Ratio and Prob > F values for the term Type{Poultry-Meat} appear. Since its precedent term has entered the model, Type{Poultry-Meat} is now allowed to enter.
10.
Click Step.
Since Type{Poultry-Meat} has the smallest Prob>F value among the remaining terms, and that value is below the Prob to Enter threshold, it is the next term to enter the model.
11.
Click Step.
The term Size{Hors d'oeuvre-Regular&Jumbo} enters the model, since its Prob>F value is 0.1577. Because its precedent term is now in the model, the term Size{Regular-Jumbo} is allowed to enter the model and its Prob>F value appears.
However, the Prob>F value for the term Size{Regular-Jumbo} is 0.7566, which exceeds the Prob to Enter value of 0.25. For this reason, if you click Step again, it is not entered into the model.
Current Estimates Report for the Final Model
12.
Click Make Model.