Overview of Neural Networks

Think of a neural network as a function of a set of derived inputs, called hidden nodes. The hidden nodes are nonlinear functions of the original inputs. You can specify up to two layers of hidden nodes, with each layer containing as many hidden nodes as you want.

Neural Network Diagram shows a two-layer neural network with three X variables and one Y variable. In this example, the first layer has two nodes, and each node is a function of all three nodes in the second layer. The second layer has three nodes, and all nodes are a function of the three X variables. The predicted Y variable is a function of both nodes in the first layer.

Neural Network Diagram

The functions applied at the nodes of the hidden layers are called activation functions. The activation function is a transformation of a linear combination of the X variables. For more details about the activation functions, see Hidden Layer Structure.

The function applied at the response is a linear combination (for continuous responses), or a logistic transformation (for nominal or ordinal responses).

The main advantage of a neural network model is that it can efficiently model different response surfaces. Given enough hidden nodes and layers, any surface can be approximated to any accuracy. The main disadvantage of a neural network model is that the results are not easily interpretable, since there are intermediate layers rather than a direct path from the X variables to the Y variables, as in the case of regular regression.

Launch the Neural Platform

Most features described in this topic are for JMP Pro only and noted with this icon.

To launch the Neural platform, select Analyze > Modeling > Neural.

Launching the Neural platform is a two-step process. First, enter your variables on the Neural launch window. Second, specify your options in the Model Launch.

The Neural Launch Window

Use the Neural launch window to specify X and Y variables, a validation column, and to enable missing value coding.

The Neural Launch Window

Description of the Neural Launch Window
Y, Response	Choose the response variable. When multiple responses are specified, the models for the responses share all parameters in the hidden layers (those parameters not connected to the responses).
X, Factor	Choose the input variables.
Freq	Choose a frequency variable.
Validation	Choose a validation column. For more information, see Validation Method. If you click the Validation button with no columns selected in the Select Columns list, you can add a validation column to your data table. For more information about the Make Validation Column utility, see Basic Analysis.
By	Choose a variable to create separate models for each level of the variable.
Missing Value Coding	Check this box to enable informative coding of missing values. This coding allows estimation of a predictive model despite the presence of missing values. It is useful in situations where missing data are informative. If this option is not checked, rows with missing values are ignored. For a continuous variable, missing values are replaced by the mean of the variable. Also, a missing value indicator, named <colname> Is Missing, is created and included in the model. If a variable is transformed using the Transform Covariates fitting option on the Model Launch window, missing values are replaced by the mean of the transformed variable. For a categorical variable, missing values are treated as a separate level of that variable.

The Model Launch

Use the Model Launch dialog to specify the validation method, the structure of the hidden layer, whether to use gradient boosting, and other fitting options.

The Model Launch Dialog

Description of the Model Launch Dialog
Validation Method	Select the method that you want to use for model validation. For details, see Validation Method.
Hidden Layer Structure or Hidden Nodes	Note: The standard edition of JMP uses only the TanH activation function, and can fit only neural networks with one hidden layer. Specify the number of hidden nodes of each type in each layer. For details, see Hidden Layer Structure.
Boosting	Specify options for gradient boosting. For details, see Boosting.
Fitting Options	Specify options for variable transformation and model fitting. For details, see Fitting Options.
Go	Fits the neural network model and shows the model reports.

After you click Go to fit a model, you can reopen the Model Launch Dialog and change the settings to fit another model.

Validation Method

Neural networks are very flexible models and have a tendency to overfit data. When that happens, the model predicts the fitted data very well, but predicts future observations poorly. To mitigate overfitting, the Neural platform does the following:

•	applies a penalty on the model parameters

•	uses an independent data set to assess the predictive power of the model

Validation is the process of using part of a data set to estimate model parameters, and using the other part to assess the predictive ability of the model.

•	The training set is the part that estimates model parameters.

•	The validation set is the part that estimates the optimal value of the penalty, and assesses or validates the predictive ability of the model.

•	The test set is a final, independent assessment of the model’s predictive ability. The test set is available only when using a validation column. See Validation Methods.

The training, validation, and test sets are created by subsetting the original data into parts. Validation Methods describes several methods for subsetting a data set.

Validation Methods

Excluded Rows

Uses row states to subset the data. Rows that are unexcluded are used as the training set, and excluded rows are used as the validation set.

For more information about using row states and how to exclude rows, see Using JMP.

Holdback

Randomly divides the original data into the training and validation sets. You can specify the proportion of the original data to use as the validation set (holdback).

KFold

Divides the original data into K subsets. In turn, each of the K sets is used to validate the model fit on the rest of the data, fitting a total of K models. The model giving the best validation statistic is chosen as the final model.

This method is best for small data sets, because it makes efficient use of limited amounts of data.

Validation Column

Uses the column’s values to divide the data into parts. The column is assigned using the Validation role on the Neural launch window. See The Neural Launch Window.

The column’s values determine how the data is split, and what method is used for validation:

•	If the column has three unique values, then:

‒	the smallest value is used for the Training set.

‒	the middle value is used for the Validation set.

‒	the largest value is used for the Test set.

•	If the column has two unique values, then only Training and Validation sets are used.

•	If the column has more than three unique values, then KFold validation is performed.

Hidden Layer Structure

Note: The standard edition of JMP uses only the TanH activation function, and can fit only neural networks with one hidden layer.

The Neural platform can fit one or two-layer neural networks. Increasing the number of nodes in the first layer, or adding a second layer, makes the neural network more flexible. You can add an unlimited number of nodes to either layer. The second layer nodes are functions of the X variables. The first layer nodes are functions of the second layer nodes. The Y variables are functions of the first layer nodes.

The functions applied at the nodes of the hidden layers are called activation functions. An activation function is a transformation of a linear combination of the X variables. Activation Functions describes the three types of activation functions.

Activation Functions
TanH	The hyperbolic tangent function is a sigmoid function. TanH transforms values to be between -1 and 1, and is the centered and scaled version of the logistic function. The hyperbolic tangent function is: where x is a linear combination of the X variables.
Linear	The identity function. The linear combination of the X variables is not transformed. The Linear activation function is most often used in conjunction with one of the non-linear activation functions. In this case, the Linear activation function is placed in the second layer, and the non-linear activation functions are placed in the first layer. This is useful if you want to first reduce the dimensionality of the X variables, and then have a nonlinear model for the Y variables. For a continuous Y variable, if only Linear activation functions are used, the model for the Y variable reduces to a linear combination of the X variables. For a nominal or ordinal Y variable, the model reduces to a logistic regression.
Gaussian	The Gaussian function. Use this option for radial basis function behavior, or when the response surface is Gaussian (normal) in shape. The Gaussian function is: where x is a linear combination of the X variables.

Boosting

Boosting is the process of building a large additive neural network model by fitting a sequence of smaller models. Each of the smaller models is fit on the scaled residuals of the previous model. The models are combined to form the larger final model. The process uses validation to assess how many component models to fit, not exceeding the specified number of models.

Boosting is often faster than fitting a single large model. However, the base model should be a 1 to 2 node single-layer model, or else the benefit of faster fitting can be lost if a large number of models is specified.

Use the Boosting panel in the Model Launch to specify the number of component models and the learning rate. Use the Hidden Layer Structure panel in the Model Launch to specify the structure of the base model.

The learning rate must be 0 < r ≤ 1. Learning rates close to 1 result in faster convergence on a final model, but also have a higher tendency to overfit data. Use learning rates close to 1 when a small Number of Models is specified.

As an example of how boosting works, suppose you specify a base model consisting of one layer and two nodes, with the number of models equal to eight. The first step is to fit a one-layer, two-node model. The predicted values from that model are scaled by the learning rate, then subtracted from the actual values to form a scaled residual. The next step is to fit a different one-layer, two-node model on the scaled residuals of the previous model. This process continues until eight models are fit, or until the addition of another model fails to improve the validation statistic. The component models are combined to form the final, large model. In this example, if six models are fit before stopping, the final model consists of one layer and 2 x 6 = 12 nodes.

Fitting Options

Fitting Options describes the model fitting options that you can specify.

Fitting Options
Transform Covariates	Transforms all continuous variables to near normality using either the Johnson Su or Johnson Sb distribution. Transforming the continuous variables helps to mitigate the negative effects of outliers or heavily skewed distributions. See the Save Transformed Covariates option in Model Options.
Robust Fit	Trains the model using least absolute deviations instead of least squares. This option is useful if you want to minimize the impact of response outliers. This option is available only for continuous responses.
Penalty Method	Choose the penalty method. To mitigate the tendency neural networks have to overfit data, the fitting process incorporates a penalty on the likelihood. See Penalty Method.
Number of Tours	Specify the number of times to restart the fitting process, with each iteration using different random starting points for the parameter estimates. The iteration with the best validation statistic is chosen as the final model.

Penalty Method

The penalty is

, where λ is the penalty parameter, and p( ) is a function of the parameter estimates, called the penalty function. Validation is used to find the optimal value of the penalty parameter.

Descriptions of Penalty Methods
Method	Penalty Function	Description
Squared		Use this method if you think that most of your X variables are contributing to the predictive ability of the model.
Absolute		Use either of these methods if you have a large number of X variables, and you think that a few of them contribute more than others to the predictive ability of the model.
Weight Decay
NoPenalty	none	Does not use a penalty. You can use this option if you have a large amount of data and you want the fitting process to go quickly. However, this option can lead to models with lower predictive performance than models that use a penalty.