Main Methods

Click on a button corresponding to a predictive modeling main method. All processes require a wide data set. (See Tall and Wide Data Sets.) For a more thorough introduction to predictive modeling and these processes, see Introduction to Predictive Modeling.

Refer to the table below for key features and general guidance on these processes. You are encouraged to explore multiple processes and use the individual process links for a more detailed explanation of each.

Tip: When in doubt, there is no harm in trying several predictive modeling methods on your data. The Predictive Modeling Review enables you to standardize model parameters and specifications. Additional tools are also available in the Model Comparisons submenu for this purpose.

Process

Uses SAS PROC(s)

Permits dependent variables of type

Particularly appropriate for data with these characteristics

Classification boundary shape for binary dependent variable

Other classification and process characteristics

Predictive Modeling Review

 

 

 

 

 

Distance Scoring

DISTANCE

Nominal
Binary
Ordinal
Continuous

 

Variable; depends on the distance metric

Nonparametric discriminant method

Tip: Diagonal Linear Discriminant Analysis can be performed via the Euclidean Distance Metric.

General Linear Model Selection

GLMSELECT

Binary
Ordinal
Continuous

 

Linear

Flexible
Many model selection methods available
Many inclusion and stopping criteria available

Partial Least Squares

PLS

Binary
Ordinal
Continuous
More variables than observations (wide data sets)
Multicollinearity among predictor variables exists

Linear or quadratic

Linear regression model
Simultaneously models variability in both dependent and predictor variables

Partition Trees

GENESELECT

Nominal
Binary
Ordinal
Continuous
Can be represented as a hierarchy of partitions

Step function

Simple tree-based rule sets from optimal splitting relationships between dependent and predictor variables are used

Quantile Regression Selection

QUANTREG

Binary
Ordinal
Continuous
The median or particular quantiles of the dependent variables are better measures of central tendency than the mean

Linear or quadratic

Flexible
Many model selection methods available
Many inclusion and stopping criteria available
Model robustness; data robustness to outliers

Radial Basis Machine

GLIMMIX

Binary
Ordinal
Continuous
More variables than observations (wide data sets)

Any shape

Dimensions of calculations are based on the number of observations, rather than the number of variables

Ridge Regression

MIXED

Binary
Ordinal
Continuous
More variables than observations (wide data sets)
Multicollinearity among predictor variables exists
Continuous dependent variable

Linear or quadratic

Computes Best Linear Unbiased Predictions (BLUPs) of the responses based on a mixed model
Shrinks (regresses) estimates toward a common mean

Discriminant Analysis

STEPDISC

DISCRIM

Nominal
Binary
Ordinal
Can be represented by a multivariate normal distribution with known classes
Fewer variables than observations

Linear, parabolic, or S-shaped

Based on Fisher discriminant analysis

K Nearest Neighbors

DISCRIM

Nominal
Binary
Ordinal
Fewer variables than observations

Any shape

Nonparametric discriminant method
Predictions based on the set of k training observations that are closest in feature space distance (instance-based learning)

Logistic Regression

LOGISTIC

Nominal
Binary
Ordinal
Fewer variables than observations

S-shaped

Data fit to a logistic curve using a logit link function

Caution: This process can take a long time to run, depending on the number of predictor variables and the speed of your machine.

Life Regression

LIFEREG

Binary
Time-to-event
Censor
Time-to-event data
Data that follows one of the time-to-event distributions (Weibull, for example)

 

Fits parametric models to time-to-event data.
Predictor reduction methods can be used to trim a large set of predictors.

Proportional Hazards Regression

PHREG

Binary
Ordinal
Continuous
Survival data with time-to-event variable (censor variable optional)
Fewer variables than observations

Exponential family

Uses a Cox proportional hazards model
Many model selection methods available

Caution: This process can be computationally intensive for large data sets.

Genomic BLUP

MIXED

HPMIXED

Binary
Continuous
 
Continuous dependent variable

Linear or quadratic

Computes Best Linear Unbiased Predictions (BLUPs) of the responses based on a mixed model.
Shrinks (regresses) estimates toward a common mean.

Predictive Modeling Review

Click to sets up a predictive modeling review that can be used to compare the efficacy of different models, applied to one or more dependent variables, at making predictions under the same conditions and compare the models using cross validation, test sets, or learning curves.

See Predictive Modeling for other subcategories.