Explore Outliers Utility

Exploring and understanding outliers in your data is an important part of analysis. Outliers in data can be due to mistakes in data collection or reporting, measurement systems failure, or the inclusion of error or missing value codes in the data set. The presence of outliers can distort estimates. Therefore, any analyses that are conducted are biased toward those outliers. Outliers also inflate the sample variance. Sometimes retaining outliers in data is necessary, however, and removing them could underestimate the sample variance and bias the data in the opposite direction.

Whether you remove or retain outliers, you must locate them. There are many ways to visually inspect for outliers. For example, box plots, histograms, and scatter plots can sometimes easily display these extreme values. See the Visualizing Your Data chapter of the Discovering JMP book for more information.

The Explore Outliers tool provides four different options to identify, explore, and manage outliers in your univariate or multivariate data.

Quantile Range Outliers

Uses the quantile distribution of each column to identify outliers as extreme values. This tool is useful for discovering missing value or error codes within the data. This is the recommended method to begin exploring outliers in your data.

Robust Fit Outliers

Finds robust estimates of the center and spread of each column and identifies outliers as those far from those values.

Multivariate Robust Outliers

Uses the Multivariate platform with Robust option to find outliers based on the Mahalanobis distance from the estimated robust center.

Multivariate k-Nearest Neighbor Outliers

Finds outliers as values far from their k-nearest neighbors.

Example of the Explore Outliers Utility

The Probe.jmp sample data table contains 387 characteristics (the Responses column group) measured on 5800 semiconductor wafers. The Lot ID and Wafer Number columns uniquely identify the wafer. You are interested in identifying outliers within a select group of columns of the data set. Use the Explore Outliers utility to identify outliers that can then be examined using the Distribution platform.

1.	Select Help > Sample Data Library and open the Probe.jmp sample data table.

2.	Expand the Responses column group and select columns VDP_M1 through VDP_SICR. You should have 14 columns selected (see Selected Columns).

Selected Columns

3.	Select Cols > Modeling Utilities > Explore Outliers.

4.	Click Quantile Range Outliers.

The Extremes report shows each column and lists the number and identity of the outliers found.

5.	Select Show only columns with outliers to limit the list of columns to only those that contain outliers.

Note that several columns contain outlier values of 9999. Many industries use nines as a missing value code.

6.	In the Nines report, select each column.

7.	Click Add Highest Nines to Missing Value Codes.

A JMP Alert indicates that you should use the Save As command to preserve your original data.

Click OK.

9.	In the Quantile Range Outliers report, click Rescan.

10.	Select Restrict search to integers.

In most cases of continuous data, integer values are often error codes or other coded data values. Notice that no additional error codes are included in this set of columns.

11.	Deselect Restrict search to integers.

Examine the Data

1.	Select all of the remaining columns in the Quantile Range Outliers report.

2.	Click Select Rows.

3.	Select Analyze > Distribution.

4.	Assign the selected columns to the Y, Columns role. Because you selected these column names in the Quantile Range Outliers report, they are already selected in the Distribution launch window.

Click OK.

Distribution of Columns with Outliers Selected shows a simplified version of the report.

Distribution of Columns with Outliers Selected

In columns VDP_M1 and VDP_PEMIT, notice that the selected outliers are somewhat close to the majority of data. For the rest of the columns, the selected outliers appear distant enough to exclude them from your analyses.

Refine Excluded Outliers

1.	In the Quantile Range Outliers report, hold Ctrl and deselect columns VDP_M1 and VDP_PEMIT.

2.	With the remaining columns selected in the report, click Exclude Rows.

3.	Change Q to 20.

4.	Click Rescan.

5.	Select columns VDP_M1 and VDP_PEMIT in the report. Click Select Rows.

Reexamine the Data

1.	Examine the Distributions report again. Notice the selected outliers are now separate enough from the majority of the data to select and exclude them from your analyses.

2.	In the Quantile Range Outliers report, click Exclude Rows.

3.	In the Distributions report, click the red triangle menu next to Distributions.

4.	Select Script > Redo Analysis.

Distributions of Columns with Outliers Excluded shows a simplified version of the report.

Distributions of Columns with Outliers Excluded

The displays of the distributions of the data are now more informative without the outliers.

Launch the Explore Outliers Utility

Note: Only continuous columns are analyzed using the Explore Outliers utility. Columns that are not continuous are ignored.

Launch the Explore Outliers utility by first selecting the columns of interest, and then selecting Cols > Modeling Utilities > Explore Outliers.

Quantile Range Outliers

The Quantile Range Outliers method of outlier detection uses the quantile distribution of the values in a column to locate the extreme values. Quantiles are useful for detecting outliers because there is no distributional assumption associated with them. Data are simply sorted from smallest to largest. For example, the 20th quantile is the value at which 20% of values are smaller. Extreme values are found using a multiplier of the interquantile range, the distance between two specified quantiles. For more details about how quantiles are computed, see Statistical Details for Quantiles in Distributions.

The Quantile Range Outliers utility is also useful for identifying missing value codes stored within the data. As noted earlier, in some industries, missing values are entered as nines (such as 999 and 9999). This utility finds any nines greater than the upper quartile as suspected missing value codes. The utility then enables you to add those missing value codes as a column property in the data table.

Quantile Range Outliers Options

The Quantile Range Outliers panel enables you to specify how outliers are to be calculated and how you want to manage them. Quantile Range Outliers Window shows the default Quantile Range Outliers window.

Quantile Range Outliers Window

An outlier is considered any value more than Q times the interquantile range from the lower and upper quantiles. You can adjust the value of Q and the size of the interquantile range.

Tail Quantile

The probability for the lower quantile that is used to calculate the interquantile range. The probability of the upper quantile is considered

. For example, a Tail Quantile value of 0.1 means that the interquantile range is between the 0.1 and 0.9 quantiles of the data. The default value is 0.1.

The multiplier that helps determine values as outliers. Outliers are considered Q times the interquantile range past the Tail Quantile and

values. Large values of Q provide a more conservative set of outliers than small values. The default is 3.

Restrict search to integers

Restricts outlier values to only integer values. This setting limits the search for outliers in order to find industry-specific missing value codes and error codes.

Show only columns with outliers

Limits the list of columns in the report to those that contain outliers.

After the report is displayed using your specifications, there are many ways to act on these extreme values. You can select the outliers in a column by selecting the specified column in the Quantile Range Outliers report.

Select Rows

Selects the rows of outliers in the selected columns in the data table.

Exclude Rows

Turns on the exclude row state for the selected rows. Click Rescan to update the Quantile Range Outliers report.

Color Cells

Colors the cells of the selected outliers in the data table.

Color Rows

Colors the rows containing outliers for the selected columns in the data table

Add to Missing Value Codes

Adds the selected outliers to the missing value codes column property. Use this option to identify known missing value or error codes within the data. Missing value and error codes are often integers and are sometimes either a positive or negative series of nines. Click Rescan to update the Quantile Range Outliers report.

Change to Missing

Changes the outlier value to a missing value in the data table. Use caution when changing values to missing. Change values to missing only if the data are known to be invalid or inaccurate. Click Rescan to update the Quantile Range Outliers report.

Rescan

Rescans the data after outlier actions have been taken.

Closes the Quantile Range Outliers panel.

Quantile Range Outliers Report

The Quantile Range Outliers report lists all columns with the outliers found using the specified options. The report shows values for the upper and lower quantiles along with their low and high thresholds. Values outside of these threshold limits are considered outliers. The number of outliers in each column is indicated. The values of each outlier are listed in the last column of the report. Outliers that occur more than once in a column are listed with their count in parentheses. To remove columns without outliers from the report, select Show only columns with outliers.

There are several things to look for when reading this report.

•	Error codes. For some continuous data, suspiciously high integer values are likely to be error codes. For example, if your upper and lower quantile values are all less than 0.5, outliers such as 1049 or -777 are likely to be error codes.

•	Zeros. Sometimes zeros can indicate missing values. If the majority of your data is reasonably large and you notice zeros as outliers, they are likely to be due to missing data.

Nines Report

The Nines report within the Quantile Range Outliers window shows a list of columns that contain probable missing value codes. These missing value codes are a series of nines (usually 9999) and are the highest number that is all nines and also higher than the upper quantile. If the count is high, it is likely that these outliers are actually missing value codes. If the count is very low, you should explore further to determine whether the value is an outlier or a missing value code. The Nines Report includes the upper quantile value.

This report is displayed only when probable missing value codes are identified.

Add Highest Nines to Missing Value Codes

Adds the selected outlier values to the missing value codes column property. You must click Rescan to update the Quantile Range Outliers report.

Change Highest Nines to Missing

Replaces the selected outlier values with missing values in the data table.

Note: The first time you use choose an action (such as Change to Missing or Exclude Rows) to change your data, the alert window warns you to use the Save As command to save your data table as a new file to preserve a copy of your original data. When this window appears, click OK. If you decide to save your new data file, select File > Save As and save the file with a new name.

Robust Fit Outliers

Robust estimates of parameters are less sensitive to outliers than non-robust estimates. Robust Fit Outliers provides several types of robust estimates of the center and spread of your data to determine those values that can be considered extreme. Robust Fit Outliers Window shows the default Robust Fit Outliers window.

Robust Fit Outliers Window

Robust Fit Outliers Options

Given a robust estimate of the center and spread, outliers are defined as those values that are K times the robust spread from the robust center. The Robust Fit Outliers window provides several options for calculating the robust estimates and multiplier K as well as provides tools to manage the outliers found.

Huber

Uses Huber M-Estimation to estimate center and spread. This option is the default. See Huber and Ronchetti (2009).

Cauchy

Assumes a Cauchy distribution to calculate estimates for the center and spread. Cauchy estimates have a high breakdown point and are typically more robust than Huber estimates. However, if your data are separated into clusters, the Cauchy distribution tends to consider only the half of the data that makes closer clusters, ignoring the rest.

Quartile

Uses the interquartile range (IQR) to estimate the spread. The estimate for the center is the median. The estimate for spread is the IQR divided by 1.34898. Dividing the IQR by this factor makes the spread correspond to one standard deviation if it was normally distributed data.

The multiplier that determines outliers as K times the spread away from the center. Large values of K provide a more conservative set of outliers than small values. The default is 4.

Show only columns with outliers

Limits the list of columns in the report to those that contain outliers.

Once the report is displayed using your specifications, there are many ways to explore these extreme values. You can select the outliers in a row by selecting the specified row in the Robust Estimates and Outliers report.

Select Rows

Selects the rows containing outliers for the selected columns in the data table.

Exclude Rows

Sets the Exclude Row state for outliers in the selected columns in the data table. Click Rescan to update the Robust Estimates and Outliers report.

Color Cells

Colors the cells of the selected outliers in the data table.

Color Rows

Colors the rows containing outliers for the selected columns in the data table.

Add to Missing Value Codes

Adds the selected outliers to the missing value codes column property for the selected columns. Use this option to identify known missing value or error codes within the data. Click Rescan to update the Robust Estimates and Outliers report.

Change to Missing

Changes the outlier value to a missing value in the data table. Click Rescan to update the Robust Estimates and Outliers report.

Rescan

Rescans the data after outlier actions have been taken.

Closes the Robust Fit Outliers panel.

Multivariate Robust Outliers

The Multivariate Robust Fit Outliers tool uses the Robust option in the Multivariate platform to examine the relationships between multiple variables. For more information about how the Multivariate platform works, see Correlations and Multivariate Techniques in the Multivariate Methods book.

Outlier Analysis

The Outlier Analysis calculates the Mahalanobis distances from each point to the center of the multivariate normal distribution. This measure relates to contours of the multivariate normal density with respect to the correlation structure. The greater the distance from the center, the higher the probability that it is an outlier. For more information about the Mahalanobis distance and other distance measures, see the Multivariate Platform Options section in the Correlations and Multivariate Techniques chapter of the Multivariate Methods book.

After the rows are excluded, you are given the option to either rerun the analysis or close the utility. Rerunning the analysis recalculates the center of the multivariate distribution without those excluded rows. Note that unless you hide the excluded rows in the data table, they still appear in the graph.

You can save the distances to the data table by selecting the Save option from the Mahalanobis Distances red triangle menu.

Multivariate Robust Outliers Mahalanobis Distance Plot

Multivariate Robust Outliers Mahalanobis Distance Plot shows the Mahalanobis distances of 16 different columns. The plot contains an upper control limit (UCL) of 4.82.This UCL is meant to be a helpful guide to show where potential outliers might be. However, you should use your own discretion to determine which values are outliers. For more details about this upper control limit (UCL), see Mason and Young (2002).

Multivariate with Robust Estimates Options

The red triangle menu for Multivariate with Robust Estimates contains numerous options to analyze your multivariate data. For a list and description of these options, see the Multivariate Platform Options section in the Correlations and Multivariate Techniques chapter of the Multivariate Methods book.

Multivariate k-Nearest Neighbor Outliers

The basic approach of outlier detection is to consider points distant from other points as outliers. One way of determining the distance of a point to most points is identifying the distance to its k-nearest neighbors. The Multivariate k-Nearest Neighbor Outliers utility displays the Euclidean distances between each point and that point’s nearest K neighbors, where

, skipping values by the Fibonacci sequence to avoid too many plots.

This approach is sensitive to the value of k. If k is too small, then a small number of nearby outliers can decrease the distance displayed and hide the outliers. If k is too large, then it is possible for points that appear to be outliers to actually be within natural clusters with smaller than k data points. In other words, small k can under-identify outliers; large k can over-identify outliers. Higher values of k, however, can reduce the impact of insignificant variables included in the analysis.

To launch the utility, select Multivariate k-Nearest Neighbor Outliers from the Commands section of the Explore Outliers window. Specify the value of k (the default is 8) and click OK.

Identifying Outliers with k-Nearest Neighbor Outliers

The K Nearest Neighbors report shows the distance between each point and that point’s nearest neighbor, the second nearest neighbor, up to the kth nearest neighbor. The data points that consistently have a large distance from their neighbors are likely to be outliers.

You can select the suspected outliers using your mouse in the plot. Explore these selected data points further by clicking Scatterplot Matrix.

If you determine the selected data points are outliers, you can exclude them from further analysis by clicking Exclude Selected Rows.

After the rows are excluded, you are given the option to either rerun the analysis or close the utility. Rerunning the analysis recalculates the k-nearest neighbors for all points except those excluded rows. Note that unless you hide the excluded rows in the data table, they still appear in the graph.

Additional Examples of the Explore Outliers Utility

Multivariate k-Nearest Neighbor Outliers Example

The Water Treatment.jmp data set contains daily measurement values of 38 sensors in an urban waste water treatment plant. You are interested in exploring these data for potential outliers. Potential outliers could include sensor failures, storms, and other situations.

1.	Select Help > Sample Data Library and open Water Treatment.jmp.

2.	Select the columns in the Sensor Measurements column group.

3.	Select Cols > Modeling Utilities > Explore Outliers > Multivariate k-Nearest Neighbor Outliers.

4.	Enter 13 for k-nearest neighbors.

Click OK.

Outliers in Multivariate k-Nearest Neighbor Outliers Example

Notice the three extreme outliers selected in the k-Nearest Neighbors plots in Outliers in Multivariate k-Nearest Neighbor Outliers Example. Each of these three data points corresponds to a date where the secondary settler in the water treatment plant was reported as malfunctioning. Because these three data points are due to faulty equipment, exclude them from future analyses.

6.	Select the three extreme outliers and click Exclude Selected Rows.

You are prompted to Rerun the utility or Close the window.

7.	Click Rerun.

8.	Type 13 for k-nearest neighbors.

Click OK.

Outliers in Multivariate k-Nearest Neighbors Example

Now locate the two light-green outliers close to row 400. Notice how they tend to stay close to each other as k increases. Even though these data points have a relatively high Distance to 13 closest, these two points have been identified as solids overloads to the water treatment plant. Since these two points are due to real situations, do not exclude them. However, you might want to keep them in mind during future analyses.

Outliers in Multivariate k-Nearest Neighbors Example

Now locate the bright pink outlier near row 375. In order to understand this data point, you need to understand its position in relation to other variables.

10.	Select the outlier near row 375.

11.	Click Scatterplot Matrix.

Looking at a scatterplot matrix of this size is not always helpful. Here, you are trying to determine whether this data point is extreme in one or more attributes.

12.	Scroll down to the bottom of the matrix. Notice how extremely low the RD-SED-G value is for this point. Not only does this point have an extremely low RD-SED-G value, it is also one of the few data points that have such low value.

13.

Scroll to the right to look at the relationship between RD-SED-G and SED-S. You can see that this point has a low RD-SED-G and high SED-S value. There does appear to be a relationship between these two variables. It is uncommon to have a low RD-SED-G or a high SED-S value, but it is not impossible. There is not enough evidence to exclude this point from your analyses.