Multivariate Inliers and Outliers

Reports | Multivariate Inliers and Outliers

Multivariate Inliers and Outliers

Report Results Description

Report Options

This report calculates Mahalanobis distance based on available data, using the equation , to identify subject inliers and outliers in multivariate space from the multivariate mean. Refer to the JMP documentation on Mahalanobis Distance Measures for statistical details. It also generates results by site to see which sites are extreme in this multivariate space.

Mahalanobis distance is plotted on the log scale to allow for easier examination of small scores. The reference line is derived from a transformation of the mean of the approximate chi-square distribution.

This report attempts to use as much data as possible. Along with sex and age, it takes all findings test codes by visit number and time number (if available), as well as frequencies of all event and intervention codes per subject. Of course, doing so can lead to missing data particularly for studies that do not appear to have a fixed number of visits or with lots of dropouts. Because Mahalanobis distance cannot be calculated with lots of missing data present, there is an option to delete variables with at least X% of missing data¹ based on the selected population and filters (default of 5%). Of remaining variables, scores are computed for those subjects with complete data. The general strategy of this report is to use as many variables as possible, while letting a few early dropouts fall out of the analysis.

Report Results Description

For the Nicardipine example shown here, 17 out of 512 variables have missing data rates below 5% and are kept. 15 of these variables have missing data, which cause Mahalanobis distance to not be calculated on 50 subjects.

Running this report for Nicardipine using default settings generates the Report shown below.

The report initially shows two sections.

Mahalanobis Distance

Presents plots of Mahalanobis distance of all subjects (distance is from the multivariate mean), colored by study site, and Box Plots presented by sites.

This section contains the following elements:

•

One JMP Mahalanobis Distances plot to identify significant outliers. In the Mahalanobis Distances plot shown above, the distance of each specific observation from the mean center of the other observations from the site is plotted. Those points residing above the upper 95% confidence interval (outliers) or below the lower 95% confidence interval (inliers) correspond to those rows that warrant the most attention due to their significant distance from the mean center of all other observations. The region between 95% and 99.7% is referred to as a moderate inlier or outlier depending of it is in the lower or upper bounds, respectively. Beyond 99.7 is considered a severe outlier or inlier.

•

One Box Plot.

The second figure replaces points seen in the Mahalonobis plot with box plots by study site. This allows the analyst to better determine how different sites are from the multivariate mean.

•

One Data Filter.

Enables you to subset subjects based on country of origin and study site. Refer to Data Filter for more information.

Missing Data

This section has the following elements:

•

Two Histograms.

Details variables that contain missing data that prevented Mahalanobis distance from being calculated for certain subjects (Flag = 1) or variables that were dropped from analysis based on the option Remove variables from analysis with a missing data percentage of at least:. By default, variables with 5% or more of missing data are not used in the calculation of Mahalanobis Distance. Data are presented either as counts (left) or percentages (right) reflect the number of values that are missing for each variable. Opening the data table shows the percentage of missing data for each test.

•

One Data Filter.

Enables you to subset histograms based on date characteristics. Additional terms can be added from the data table using the AND and OR buttons of the filter.

Action Buttons

Action buttons, provide you with an easy way to drill down into your data. The following action buttons are generated by this report:

•

Profile Subjects: Select subjects and click to generate the patient profiles. See Profile Subjects for additional information.

•

Show Subjects: Select subjects and click to open the ADSL (or DM if ADSL is unavailable) of selected subjects.

•

Create Subject Filter: Select subjects and click to create a subject filter. Follow-up analyses are subset to these subjects if the Select saved subject Filter is applied in the dialog.

•

Variable Contributions to Distance: Select subjects and click to create Pareto plots for each subject that explain how much each variable contributes to a patient being an outlier, with a total possible of 100% for each subject. Analysis of the Pareto plots can enable you to view how selected subjects are extreme for the selected covariates.

•

Select Percentage of Subjects Exceeding Threshold: Click to select a proportion of patients based on exceeding a threshold.

General

•

Click to view the associated data tables. Refer to

Output includes one summary data set (named csass_sum_XXX², by default) containing one record per subject with pre-dosing data, one data set of all pairwise distances within the covariate subgroups (named csass_alldist_XXX, by default), one data set containing minimum pairwise distances for each covariate subgroup (named csass_mindist_XXX), by default), one data set per covariate subgroup containing pairwise distances (named csass_p_Y_XXX, by default, where Y is indexed 1 to the number of covariate subgroups) and one data set per covariate subgroup containing the distance matrix of subjects within the covariate subgroup (named csass_Y_XXX, by default, where Y is indexed 1 to the number of covariate subgroups).

Variable names for Findings data are concatenated with the abbreviation of the Findings test and the visit number (V). For example, DIABP_V2 is the diastolic blood pressure at visit 2. If there are multiple measurements at Visit 2, then it is the average. If there are multiple time points on a single visit, a time number is appended. For example, DIABP_V2_T1 would be the diastolic blood pressure at time point 1 at visit 2 (or the average, if multiple measurements are taken); DIABP_V2_T2 would be the diastolic blood pressure at time point 2 at visit 2.

•

Click to generate a standardized pdf- or rtf-formatted report containing the plots and charts of selected sections.

•

Click to take notes, and store them in a central location. Refer to Add Notes for more information.

•

Click to read user-generated notes. Refer to View Notes for more information.

•

Click the Options arrow to reopen the completed report dialog used to generate this output.

•

Click the gray border to the left of the Options tab to open a dynamic report navigator that lists all of the reports in the review. Refer to Report Navigator for more information.

Report Options

General

Demographics

The Include Age and Include Sex are used to include age and sex (when these are not matching criteria) in the analysis for computing distance.

Findings

You can opt to Include findings domains to analyze and summarize findings by visit and time point number. If you decide to include finding results, you must specify which Findings Tests to include and whether to Analyze findings using: using original or standard units. Check the Remove unscheduled visits option to included findings from scheduled visits only; leave the option unchecked to include findings from all visits.

Interventions

You can opt to Include intervention domains to analyze the frequency of each intervention. By default, all intervention domains are included. However, you can restrict the analysis to domains specified using the Subset of Domains to Analyze for Interventions option.

Events

You can opt to Include event domains to analyze the frequency of each event. By default, all event domains are included. However, you can restrict the analysis to domains specified using the Subset of Domains to Analyze for Events option. You can use the following options to exclude events:

The Include events or interventions experienced by at least this percent of patients: option includes all events that are experienced by the specified threshold percentage of subjects; events not meeting this threshold are excluded.

Checking the Remove all variables with missing values option excludes all variables with one or more missing values. The Remove variables from analysis with a missing data percentage of at least: option enables you to be more permissive with regards to missing data. When the Remove all variables with missing values option is not checked, you can specify a threshold value for exclusion; only those variables whose percentage of missing values exceeds this threshold are excluded. All other variables are included in the analysis.

Site Summary

Use the Summarize sites with at least this many subjects: option to specify the number of subjects a site must have in the selected population in order to include it in the summary.

Filtering the Data:

Filters enable you to restrict the analysis to a specific subset of subjects based on values within variables. You can also filter based on population flags (Safety is selected by default) within the study data.

See Select the analysis population, Select saved subject Filter³, Additional Filter to Include Subjects, and Subset of Visits to Analyze for Findings for more information.

1

Missing data is show in the tabular data pane that can be viewed by clicking the drill down option. A Missing Flag value of 1 indicates variables with a missing rate of <5%. A Missing Flag value of 2 indicates variables with a missing rate of >5%, based on the default dialog option of 5%.

2

The _XXX designation is used to designate a one- to three-digit number that is added sequentially to prevent overwriting of existing data sets.

3

Subject-specific filters must be created using the Create Subject Filter report prior to your analysis.