Hierarchical Clustering
The Hierarchical option clusters rows that group the points (rows) of a JMP table into clusters whose values are close to each other relative to those of other clusters. Hierarchical clustering is a process that starts with each point in its own cluster. At each step, the two clusters that are closest together are combined into a single cluster. This process continues until there is only one cluster containing all the points. This type of clustering is good for smaller data sets (a few hundred observations).
Hierarchical clustering enables you to sort clusters by their mean value by specifying an Ordering column. One way to use this feature is to complete a Principal Components analysis (using Multivariate) and save the first principal component to use as an Ordering column. The clusters are then sorted by these values.
For Hierarchical clustering, select Hierarchical from the Options list on the platform launch window and then select one of the clustering distance options: Average, Centroid, Ward, Single, and Complete, and Fast Ward. The clustering methods differ in how the distance between two clusters is computed. These clustering methods are discussed under Statistical Details for Hierarchical Clustering.
Select this option if you have data that is summarized by Object ID. The Data as summarized option calculates group means and treats them as input data.
Select this option if you have a data table of distances instead of raw data. If your raw data consists of n observations, the distance table should have n rows and n columns, with the values being the distances between the observations. The distance table needs to have an additional column giving a unique identifier (such as row number) that matches the column names of the other n columns. The diagonal elements of the table should be zero, since the distance between a point and itself is zero. The table can be square (both upper and lower elements), or it can be upper or lower triangular. If using a square table, the platform gives a warning if the table is not symmetric. For an example of what the distance table should look like, use the option Save Distance Matrix.
By default, data in each column are first standardized by subtracting the column mean and dividing by the column standard deviation. Uncheck the Standardize Data check box if you do not want the cluster distances computed on standardized values.
The Standardize Robustly option reduces the influence of outliers on estimates of the mean and standard deviation. Outliers in a column inflate the standard deviation, thereby deflating standardized data values and giving them less influence in determining multivariate distances.
Use the Missing value imputation option to impute missing values. Missing value imputation is done assuming that there are no clusters, that the data come from a single multivariate normal distribution, and that the values are missing completely at random. These assumptions are usually not reasonable in practice. Thus, this feature must be used with caution, but it can produce more informative results than discarding most of your data.
Use the Add Spatial Measures option when your data is stacked and contains two attributes that correspond to spatial coordinates (X and Y, for example). This option adds measures for circle, pie, and streak spatial measures to aid in clustering defect patterns.
Contains options for scaling the dendrogram. Distance Scale shows the actual joining distance between each join point, and is the same scale used on the plot produced by the Distance Graph command. Even Spacing shows the distance between each join point as equal. Geometric Spacing is useful when there are many clusters and you want the clusters near the top of the tree to be more visible than those at the bottom. (This option is the default for more than 256 rows).