Clustering is a multivariate technique of grouping rows together that share similar values. It can use any number of variables. The variables must be numeric variables for which numerical differences make sense. The common situation is that data are not scattered evenly through n-dimensional space, but rather they form clumps, locally dense areas, modes, or clusters. The identification of these clusters goes a long way toward characterizing the distribution of values.
hierarchical clustering for small tables, up to several thousand rows. It combines rows in a hierarchical sequence portrayed as a tree. In JMP, the tree, also called a dendrogram, is a dynamic, responding graph. You can choose the number of clusters that you like after the tree is built.
K-means clustering is appropriate for larger tables, up to hundreds of thousands of rows. It makes a fairly good guess at cluster seed points. It then starts an iteration of alternately assigning points to clusters and recalculating cluster centers. You have to specify the number of clusters before you start the process.
Hierarchical clustering is also called agglomerative clustering because it is a combining process. The method starts with each point (row) as its own cluster. At each step the clustering process calculates the distance between each cluster, and combines the two clusters that are closest together. This combining continues until all the points are in one final cluster. The user then chooses the number of clusters that seems right and cuts the clustering tree at that point. The combining record is portrayed as a tree, called a dendrogram. The single points are leaves, the final single cluster of all points are the trunk, and the intermediate cluster combinations are branches. Since the process starts with n(n + 1)/2 distances for n points, this method becomes too expensive in memory and time when n is large.
K-means clustering is an iterative follow-the-leader strategy. First, the user must specify the number of clusters, k. Then a search algorithm goes out and finds k points in the data, called seeds, that are not close to each other. Each seed is then treated as a cluster center. The routine goes through the points (rows) and assigns each point to the closest cluster. For each cluster, a new cluster center is formed as the means (centroid) of the points currently in the cluster. This process continues as an alternation between assigning points to clusters and recalculating cluster centers until the clusters become stable.
Normal mixtures clustering, like k-means clustering, begins with a user-defined number of clusters and then selects distance seeds. JMP uses the cluster centers chosen by k-means as seeds. However, each point, rather than being classified into one group, is assigned a probability of being in each group.
SOMs are a variation on k-means where the cluster centers are laid out on a grid. Clusters and points close together on the grid are meant to be close together in the multivariate space. See Self Organizing Maps.
K-means, normal mixtures, and SOM clustering are doubly iterative processes. The clustering process iterates between two steps in a particular implementation of the EM algorithm:
The expectation step of mixture clustering assigns each observation a probability of belonging to each cluster.