Discriminant Analysis

Discriminant analysis predicts membership of each document in a group or category based on the columns in the document term matrix (DTM). Specifically, discriminant analysis predicts a classification of each document into a category of a response column. When you select the Discriminant Analysis option, you must select a response column that contains categories or groups. Group membership is predicted by the columns of the DTM. For more information about discriminant analysis, see Discriminant Analysis in Multivariate Methods.

The discriminant analysis method in the Text Explorer platform is based on a singular value decomposition of the centered DTM. Each group of the response column has its own group mean that is used to center the DTM. The discriminant analysis method in the Text Explorer platform is faster than the Discriminant Analysis platform because it takes advantage of the sparsity of the DTM.

Discriminant Analysis Specifications Window

The Discriminant Analysis option in the Text Explorer platform is based on the Document Term Matrix (DTM). The DTM is formed by creating a column for each term in the Term List (up to a specified Maximum Number of Terms). Each text document (equivalent to a row in the data table) corresponds to a row of the DTM. The values in the cells of the DTM depend on the type of weighting specified by the user in the Specifications window.

When you select the Discriminant Analysis option from the Text Explorer red triangle menu, the Specifications window appears with the following options:

Maximum Number of Terms

The maximum number of terms included in the discriminant analysis.

Minimum Term Frequency

The minimum number of occurrences a term must have to be included in the discriminant analysis.

Weighting

The weighting scheme that determines the values that go into the cells of the document term matrix. The weighting scheme options are described in Document Term Matrix Specifications Window.

Number of Singular Vectors

The number of singular vectors in the discriminant analysis. The default value is the minimum of the number of documents, the number of terms, or 100.

Discriminant Analysis Report

By default, the Discriminant Analysis report in the Text Explorer platform contains two open reports: the Classification Summary and the Discriminant Scores. The other reports are initially closed.

The Discriminant Analysis report also contains the following reports:

Term Means

Provides a table of the terms used in the discriminant analysis. The terms correspond to the columns of the DTM. The table contains the means in each group for each term, as well as the overall mean and weighted standard deviation for each term.

Squared Distances to Each Group

Provides a table that contains the squared Mahalanobis distances to each group for each document. For more information about Mahalanobis distances, see Outlier Analysis in Multivariate Methods.

Probabilities to Each Group

Provides a table that contains the probability that a document belongs to each group.

Classification Summary

Provides a report that summarizes the discriminant scores. This report corresponds to the Score Summaries report in the Discriminant Analysis platform report.

Discriminant Scores

Provides a table of the predicted classification of each document and other supporting information. This table corresponds to the Discriminant Scores table in the Discriminant Analysis platform report.

Discriminant Analysis Report Options

The Discriminant Analysis red triangle menu contains the following options:

Canonical Plot

Shows or hides a plot of the documents and group means in canonical space. Canonical space is the space that most separates the groups. If there are more than two levels of the response variable, you must specify the number of canonical coordinates. If you specify more than two canonical coordinates, this option produces a matrix of canonical plots.

Save Probabilities

Saves a probability column to the data table for each response level as well as a column that contains the most likely response. The Most Likely response column contains the level with the highest probability based on the model.

Each probability column gives the posterior probability of an observation’s membership in that level of the response. The Response Probability column property is saved to each probability column. For more information about the Response Probability column property, see Column Properties in Using JMP.

Save Probability Formulas

Saves formula columns to the data table for the prediction of the most likely response. The first saved column contains a formula that uses the Text Score() function to calculate the probability for each response level. There are also columns that contain probabilities for each response level as well as a column that contains the predicted response.

Save Canonical Scores

Saves columns to the data table that contain the scores from canonical space for each observation. Canonical space is the space that most separates the groups. The column for the kth canonical score is named Canonical<k>.

Remove

Removes the Discriminant Analysis report from the Text Explorer report window.

Want more information? Have questions? Get answers in the JMP User Community (community.jmp.com).