Hierarchical Clustering

The Hierarchical option clusters rows that group the points (rows) of a JMP table into clusters whose values are close to each other relative to those of other clusters. Hierarchical clustering is a process that starts with each point in its own cluster. At each step, the two clusters that are closest together are combined into a single cluster. This process continues until there is only one cluster containing all the points. This type of clustering is good for smaller data sets (a few hundred observations).

Hierarchical clustering enables you to sort clusters by their mean value by specifying an Ordering column. One way to use this feature is to complete a Principal Components analysis (using Multivariate) and save the first principal component to use as an Ordering column. The clusters are then sorted by these values.

For Hierarchical clustering, select Hierarchical from the Options list on the platform launch window and then select one of the clustering distance options: Average, Centroid, Ward, Single, and Complete, and Fast Ward. The clustering methods differ in how the distance between two clusters is computed. These clustering methods are discussed under Statistical Details for Hierarchical Clustering.

The following options determine the form of the data that is used in calculating multivariate distances.

Data as usual

Select this option if you have typical, rectangular data.

Data as summarized

Select this option if you have data that is summarized by Object ID. The Data as summarized option calculates group means and treats them as input data.

Data is distance matrix

Select this option if you have a data table of distances instead of raw data. If your raw data consists of n observations, the distance table should have n rows and n columns, with the values being the distances between the observations. The distance table needs to have an additional column giving a unique identifier (such as row number) that matches the column names of the other n columns. The diagonal elements of the table should be zero, since the distance between a point and itself is zero. The table can be square (both upper and lower elements), or it can be upper or lower triangular. If using a square table, the platform gives a warning if the table is not symmetric. For an example of what the distance table should look like, use the option Save Distance Matrix.

Data is stacked

Select this option if you have a data that is stacked. For example, data for one object that spans multiple rows is considered stacked. Stacked data is identified by Attribute ID and Object ID. The Standardize Data option is not appropriate for stacked data.

Standardize Data

By default, data in each column are first standardized by subtracting the column mean and dividing by the column standard deviation. Uncheck the Standardize Data check box if you do not want the cluster distances computed on standardized values.

Standardize Robustly

The Standardize Robustly option reduces the influence of outliers on estimates of the mean and standard deviation. Outliers in a column inflate the standard deviation, thereby deflating standardized data values and giving them less influence in determining multivariate distances.

The Standardize Robustly option uses Huber M-estimates of the mean and standard deviation (Huber, 1964, Huber, 1973, and Huber and Ronchetti, 2009). For columns with outliers, this option gives the standardized values greater representation in determining multivariate distances. The option can result in isolated clusters of outliers.


Note: If both Standardize Data and Standardize Robustly are checked, each column is standardized by subtracting its robust column mean and dividing by its robust standard deviation. This is useful when columns represent different measurement scales or when observations tend to be outliers in only specific dimensions. If Standardize Data is unchecked and Standardize Robustly is checked, the robust mean and standard deviation for the data in all columns combined are used to standardize each column. This can be useful when columns all represent the same measurement scale and when observations tend to be outliers in all dimensions.

Missing value imputation

Use the Missing value imputation option to impute missing values. Missing value imputation is done assuming that there are no clusters, that the data come from a single multivariate normal distribution, and that the values are missing completely at random. These assumptions are usually not reasonable in practice. Thus, this feature must be used with caution, but it can produce more informative results than discarding most of your data.

Using the Pairwise method, a single covariance matrix is formed for all the data. Then each missing value is imputed by a method that is equivalent to regression prediction using all the nonmissing variables as predictors. If you have categorical variables, the algorithm uses the category indices as dummy variables. If regression prediction fails due to a non-positive-definite covariance for the nonmissing values, JMP uses univariate means.

Add Spatial Measures

Use the Add Spatial Measures option when your data is stacked and contains two attributes that correspond to spatial coordinates (X and Y, for example). This option adds measures for circle, pie, and streak spatial measures to aid in clustering defect patterns.

Hierarchical Cluster Report

The Hierarchical Cluster report displays the method used, a dendrogram tree diagram, and the Clustering History table. If you assigned a label in the launch window, its values identify each observation in the dendrogram.

The dendrogram is a tree diagram that lists each observation and shows which cluster it is in and when it entered the cluster. You can drag the small diamond-shaped handle at either the top or bottom of the dendrogram to identify a given number of clusters. If you click on any cluster stem, all the members of the cluster highlight in the dendrogram and in the data table.

The scree plot beneath the dendrogram has a point for each cluster join. The ordinate is the distance that was bridged to join the clusters at each step. Often there is a natural break where the distance jumps up suddenly. These breaks suggest natural cutting points to determine the number of clusters.

The Clustering History table contains the history of the cluster, from each data point in its own cluster to all points in one cluster. The order of the clusters at each join is unimportant, essentially an accident of how the data was sorted.

Hierarchical Cluster Options

The Hierarchical Cluster red triangle menu includes the following commands.

Description of the Hierarchical Cluster Control Panel
Color Clusters	Assigns colors to the rows of the data table corresponding to the cluster the row belongs to. Also colors the dendrogram according to the clusters. The colors automatically update if you change the number of clusters. Deselecting this option disconnects the number of clusters, but does not change the colors.
Mark Clusters	Assigns markers to the rows of the data table corresponding to the cluster the row belongs to. The markers automatically update if you change the number of clusters. Deselecting this option disconnects the number of clusters, but does not change the markers.
Number of Clusters	Prompts you to enter a number of clusters and positions the dendrogram slider to that number.
Cluster Criterion	Gives the Cubic Clustering Criterion for range of number of clusters.
Show Dendrogram	Shows or hides the Dendrogram report.
Dendrogram Scale	Contains options for scaling the dendrogram. Distance Scale shows the actual joining distance between each join point, and is the same scale used on the plot produced by the Distance Graph command. Even Spacing shows the distance between each join point as equal. Geometric Spacing is useful when there are many clusters and you want the clusters near the top of the tree to be more visible than those at the bottom. (This option is the default for more than 256 rows).
Distance Graph	Shows or hides the scree plot at the bottom of the histogram.
Show NCluster Handle	Shows or hides the handles on the dendrogram used to manually change the number of clusters.
Zoom to Selected Rows	Is used to zoom the dendrogram to a particular cluster after selecting the cluster on the dendrogram. Alternatively, you can double-click on a cluster to zoom in on it.
Release Zoom	Returns the dendrogram to original view after zooming.
Pivot on Selected Cluster	Reverses the order of the two sub-clusters of the currently selected cluster.
Color Map	Gives the option to add a color map showing the values of all the data colored across its value range. There are several color theme choices in a submenu. Another term for this feature is heat map.
Two way clustering	Adds clustering by column. A color map is automatically added with the column dendrogram at its base. The columns must be measured on the same scale.
Positioning	Provides options for changing the positions of dendrograms and labels.
Legend	Shows or hides a legend for the colors used in a color map. This option is available only if a color map is enabled.
More Color Map Columns	Adds a color map for specified columns.
Constellation Plot	Arranges the individuals as endpoints and each cluster join as a new point, with lines drawn that represent membership. The longer lines represent greater distance between clusters. To turn off the displayed labels, right-click inside the Constellation Plot and select Show Labels.
Save Constellation Coordinates	Saves the coordinates of the constellation plot to the data table.
Save Clusters	Creates a data table column containing the cluster number.
Save Formula for Closest Cluster	Creates a data table column containing a formula to the closest cluster. This option calculates the squared Euclidean distance to each cluster’s centroid and selects the cluster that is closest. Note that this formula does not always reproduce the cluster assignment given by Hierarchical Clustering since the clusters are determined differently. However, the cluster assignment is very similar.
Save Display Order	Creates a data table column containing the order the row is presented in the dendrogram.
Save Cluster Hierarchy	Saves information needed if you are going to do a custom dendrogram with scripting. For each clustering, it outputs three rows, the joiner, the leader, and the result, with the cluster centers, size, and other information.
Save Cluster Tree	Saves information needed if you are going to compare cluster trees between JMP and SAS. For each clustering, it outputs one row for each new cluster, with the cluster’s size and other information.
Save Distance Matrix	Makes a new data table containing the distances between the observations.
Save Cluster Means	Creates a new data table containing the number of rows and the means of each column in each cluster.
Cluster Summary	Displays a table of cluster means, a graph of means by cluster for each column, and a table of RSquare values of each column against the current clusters.
Scatterplot Matrix	Creates a scatterplot matrix using all the variables.
Parallel Coord Plots	Creates a parallel coordinate plot for each cluster. For details about the plots, see the Basic Analysis book.
Script	Contains options that are available to all platforms. See Using JMP.