Publication date: 08/13/2020

This example uses data on movies that were released in 2011. You are particularly interested in the World Gross values, which represent the gross receipts. Your potential predictors are Rotten Tomatoes Score, Audience Score, and Genre. The two score variables are continuous, but Genre is nominal. Before you attempt to reduce your model using Stepwise, you want to explore the variables of interest.

1. Select Help > Sample Data Library and open Hollywood Movies.jmp.

2. Select Analyze > Distribution.

3. Select Genre and click Y, Columns.

4. Click OK.

Figure 5.14 Distribution of Genre

Note that Genre has nine levels, so it is represented by eight model terms. Further data exploration will reveal that, because of missing data, only eight levels are considered by Stepwise.

5. In the data table’s Columns panel, select the columns of interest: Rotten Tomatoes Score, Audience Score, and World Gross.

6. Select Analyze > Screening > Explore Missing Values.

7. Click Y, Columns and click OK.

Figure 5.15 Missing Columns Report

Note that Rotten Tomatoes Score is missing in 2 rows, Audience Score is missing in 1 row, and World Gross is missing in 2 rows.

8. In the Missing Columns report, select the three columns listed under Column.

9. Click Select Rows.

In the data table’s Rows panel, you can see that three rows are selected. Because these three rows contain missing data on the predictors or response, they are automatically excluded from the Stepwise analysis. Note that row 128 is the only entry in the Adventure category, which means that category is entirely removed from the analysis. For the purposes of the Stepwise analysis, it follows that Genre has only eight categories. Now that you have seen the effect of the missing data, you will conduct the Stepwise analysis.

10. Select Analyze > Fit Model.

11. Select Rotten Tomatoes Score, Audience Score, and Genre and click Add.

If you fit a standard least squares model to World Gross using Rotten Tomatoes Score, Audience Score, and Genre as predictors, the residuals are highly heteroscedastic. (This is typical of financial data.) Use a log transformation to better satisfy the regression assumption of equal variance.

12. Right-click World Gross in the Select Columns list and select Transform > Log.

The transformed variable Log[World Gross] appears at the bottom of the Select Columns list.

13. Select Log[World Gross] and click Y.

14. Select Stepwise from the Personality list.

15. Click Run.

Figure 5.16 Current Estimates Table Showing List of Model Terms

In the Current Estimates table, note that Genre is represented by 7 terms. You will construct a model using two of these to see how these terms are defined.

16. Check the boxes under Entered next to the first two terms for Genre:

– Genre{Drama&Horror&Thriller&Fantasy&Romance&Comedy-Action&Animation}

– Genre{Drama&Horror&Thriller-Fantasy&Romance&Comedy}

17. Click Make Model.

Notice that the two terms are added as temporary transform columns to the Model Effects list in the Model Specification window. These columns are discussed in the next section.

Recall that because of missing values, Genre is a nominal variable with eight levels. In the Current Estimates table, Genre is represented by seven terms. This is appropriate, because Genre has eight levels. The first two terms that represent Genre are described below. Subsequent terms are defined in a similar fashion.

The first term that appears is Genre{Drama&Horror&Thriller&Fantasy&Romance&Comedy-Action&Animation}. This variable has the form Genre{A1 - A2}, where A1 and A2 are separated by a minus sign. The notation indicates that the maximum separation in terms of sum of squares between groups occurs between the following two sets of levels:

• Drama, Horror, Thriller, Fantasy, Romance, and Comedy (represented by A1)

• Action and Animation (represented by A2)

If you include the term Genre{Drama&Horror&Thriller&Fantasy&Romance&Comedy-Action&Animation} in a model, a temporary transform column representing that term is used in the model. The column contains the following values:

• 1 for Drama, Horror, Thriller, Fantasy, Romance, and Comedy

• -1 for Action and Animation

The second term that appears is Genre{Drama&Horror&Thriller-Fantasy&Romance&Comedy}. This set of levels is entirely contained in the first split for the first term (A1). The notation contrasts the levels:

• Drama, Horror, and Thriller

• Fantasy, Romance, and Comedy

Among all the splits of the levels of Drama, Horror, Thriller, Fantasy, Romance, and Comedy (A1) and of the levels of Action and Animation (A2), the algorithm determines that this split has the largest sum of squares between groups.

If you include this term in a model, a temporary transform column representing that term is used in the model. The column contains the following values:

• 1 for Drama, Horror, and Thriller

• -1 for Fantasy, Romance, and Comedy

• 0 for Action and Animation

The splitting of terms continues, based on the sum of squares between groups criterion. The hierarchy that leads to the definition of the terms is illustrated in Figure 5.17.

Figure 5.17 Tree Showing Splits Used in Hierarchical Coding

When you use the Combine rule or the Restrict rule, a term cannot enter the model unless all the terms above it in the hierarchy have been entered. For example, if you enter Genre{Action-Animation}, then JMP enters Genre{Drama&Horror&Thriller&Fantasy&Romance&Comedy-Action&Animation} as well.

When you use the Whole Effects rule and enter any one of the Genre terms, all of the Genre terms are entered.

Want more information? Have questions? Get answers in the JMP User Community (community.jmp.com).

.