K-Means Clustering

The k-means approach to clustering performs an iterative alternating fitting process to form the number of specified clusters. The k-means method first selects a set of n points called cluster seeds as a first guess of the means of the clusters. Each observation is assigned to the nearest seed to form a set of temporary clusters. The seeds are then replaced by the cluster means, the points are reassigned, and the process continues until no further changes occur in the clusters. When the clustering process is finished, you see tables showing brief summaries of the clusters. The k-means approach is a special case of a general approach called the EM algorithm; E stands for Expectation (the cluster means in this case), and M stands for maximization, which means assigning points to closest clusters in this case.

The k-means method is intended for use with larger data tables, from approximately 200 to 100,000 observations. With smaller data tables, the results can be highly sensitive to the order of the observations in the data table.

K-Means clustering only supports numeric columns. K-Means clustering ignores model types (nominal and ordinal), and treat all numeric columns as continuous columns.

To see the KMeans cluster launch dialog (see KMeans Launch Dialog), select KMeans from the Options menu on the platform launch dialog. The figure uses the Cytometry.jmp data table.

KMeans Launch Dialog

The dialog has the following options:

Columns Scaled Individually

is used when variables do not share a common measurement scale, and you do not want one variable to dominate the clustering process. For example, one variable might have values that are between 0-1000, and another variable might have values between 0-10. In this situation, you can use the option so that the clustering process is not dominated by the first variable.

Johnson Transform

balances highly skewed variables or brings outliers closer to the center of the rest of the values.

K-Means Control Panel

As an example of KMeans clustering, use the Cytometry.jmp sample data table. Add the variables CD3 and CD8 as Y, Columns variables. Select the KMeans option. Click OK. The Control Panel appears, and is shown in Iterative Clustering Control Panel.

Iterative Clustering Control Panel

The Iterative Clustering red-triangle menu has the Save Transformed option. This saves the Johnson transformed variables to the data table. This option is available only if the Johnson Transform option is selected on the launch dialog (KMeans Launch Dialog).

The Control Panel has these options:

Description of K-Means Clustering Control Panel Options

Declutter

Locates outliers in the multivariate sense. Plots are produced giving distances between each point and that points nearest neighbor, the second nearest neighbor, up to the kth nearest neighbor. You are prompted to enter k. Beneath the plots are options to create a scatterplot matrix, save the distances to the data table, or to not include rows that you have excluded in the clustering procedure. If an outlier is identified, you might want to exclude the row from the clustering process.

Method

Chooses the Clustering Method. The available methods are:

•	KMeans Clustering is described in this section.

•	Normal Mixtures is described in Normal Mixtures.

•	Robust Normal Mixtures is described in Normal Mixtures.

•	Self Organizing Map is described in Self Organizing Maps.

Number of Clusters

Designates the number of clusters to form.

Optional range of clusters

Provides an upper bound for the number of clusters to form. If a number is entered here, the platform creates separate analyses for every integer between Number of clusters and this one.

Single Step

Enables you to step through the clustering process one iteration at a time using a Step button, or automate the process using a Go button.

Use within-cluster std deviations

If you do not use this option, all distances are scaled by an overall estimate of the standard deviation of each variable. If you use this option, distances are scaled by the standard deviation estimated for each cluster.

Shift distances using sampling rates

Assumes that you have a mix of unequally sized clusters, and points should give preference to being assigned to larger clusters because there is a greater prior probability that it is from a larger cluster. This option is an advanced feature. The calculations for this option are implied, but not shown for normal mixtures.

K-Means Report

Clicking Go in the Control Panel in Iterative Clustering Control Panel produces the K-Means report, shown in K-Means Report.

K-Means Report

The report gives summary statistics for each cluster:

•	count of number of observations

•	means for each variable

•	standard deviations for each variable.

The Cluster Comparison report gives fit statistics to compare different numbers of clusters. For KMeans Clustering and Self Organizing Maps, the fit statistic is CCC (Cubic Clustering Criterion). For Normal Mixtures, the fit statistic is BIC or AICc. Robust Normal Mixtures does not provide a fit statistic.

K-Means Platform Options

These options are accessed from the red-triangle menus, and apply to KMeans, Normal Mixtures, Robust Normal Mixtures, and Self-Organizing Map methods.

Descriptions of K-Means Platform Options

Biplot

Shows a plot of the points and clusters in the first two principal components of the data. Circles are drawn around the cluster centers. The size of the circles is proportional to the count inside the cluster. The shaded area is the 90% density contour around the mean. Therefore, the shaded area indicates where 90% of the observations in that cluster would fall. Below the plot is an option to save the cluster colors to the data table.

Biplot Options

Contains options for controlling the Biplot.

•	Show Biplot Rays enables you to show or hide the biplot rays.

•	Biplot Ray Position enables you to position the biplot ray display. This is viable since biplot rays only signify the directions of the original variables in canonical space, and there is no special significance to where they are placed in the graph.