Introduction to Predictive Modeling

Style

section-padding-none

What is predictive modeling?

Predictive modeling, or predictive analytics, is about using data and statistical algorithms to predict what might happen next, given the current process and environment. It is part of the descriptive, predictive, and prescriptive analytical spectrum.

What are some widely used predictive modeling methods?

Predictive models fall into two general categories: supervised and unsupervised. With supervised methods, you are interested in predicting values of an output variable based on a collection of input variables. Methods for supervised learning include multiple linear regression, logistic regression, decision trees, neural networks, and others.

With unsupervised methods, you study a collection of variables with no known or observed response variables. Methods for unsupervised learning include principal components analysis, cluster analysis, factor analysis, and others.

Below are some of the more frequently used predictive models:

Supervised learning, continuous response

Multiple linear regression
Penalized regression
Decision trees
Neural networks
Support vector regression
K nearest neighbors

Supervised learning, categorical response

Logistic regression
Penalized logistic regression
Decision trees
Neural networks
Support vector machines
K nearest neighbors
Discriminant analysis
Naïve Bayes classification

Unsupervised learning

Principal components analysis
Hierarchical clustering
K-means clustering
Association analysis/market basket analysis
Factor analysis

Example data set for predictive modeling

Other pages in this section discuss predictive modeling techniques. The data they use are described here.

Let’s say that you own a process that recovers a chemical substance from a substrate. You would like to find process settings that maximize the yield (a continuous response variable) and maximize the quality (a categorical response variable) of the substance recovered. You measure many process variables for each batch, both continuous and categorical.

If you have JMP on your computer, you can download the JMP data set Recovery.jmp for your own analysis as you go through these predictive modeling pages. (If you don’t have access to JMP, download a free trial here.)

Figure 1: Distributions of two responses: Percent Recovered and Quality Level.

Figure 2: Distributions of 24 predictor variables in Recovery data.

You can fit a main effects model to the response Percent Recovered, using a validation set for honest assessment.

Figure 3: Prediction Profiler for main effects model fit to the Percent Recovered response of the Recovery data. Cross sections of the multidimensional fitted surface are displayed.

Figure 4: Prediction Profiler for logistic model fit to the Quality Level response of the Recovery data, predicting the probability that Quality Level = Best

layout

2 column

Style

columns-75-25, section-top-padding-xsmall