Neural Networks

Style

section-padding-none

What is a neural network?

An artificial neural network is a very flexible model that can handle complex relationships between inputs and outputs. Neural networks are modeled after neuron functioning in the human brain. In a biological neuron, input signals from many other neurons are processed to produce an output signal. Humans have billions of neurons in interconnected neural networks. The information from these networks is combined, enabling us to learn from past experiences and generalize our knowledge. The first artificial neural networks were developed to study and mimic the ability of the human brain to learn from past experience.

What are some of the advantages and disadvantages of neural networks?

Advantages

Neural networks can model continuous or categorical responses.
A neural network is a very flexible model, so you can achieve lower bias without increasing variance.
Neural networks don’t have the usual statistical assumptions of normality or independence.
Neural networks can detect complex nonlinear relationships.

Disadvantages

Neural networks are a black box – they can be difficult to understand and interpret.
Large data sets are needed to adequately train a neural network.
Care must be taken to avoid overfitting.
Choosing which activation functions, the number of hidden nodes, and the number of hidden layers can be difficult.

A neural network with one hidden layer

An artificial neural network has an input layer, one or more hidden layers, and an output layer. Let’s consider a simple neural network for the Recovery data we introduced in our overview of predictive modeling. We’ll start by building a model using the continuous response, Percent Recovered.

Figure 1: Diagram of neural network model with 24 predictor variables and a continuous response. The neural model has one layer with five hidden nodes: four hyperbolic tangent functions (TanH) and one linear activation function.

The diagram of the neural network shows its structure. In ordinary linear regression, a linear combination of the x variables predicts the response. In the neural model, there is an extra layer between the predictors and the response. In this example, the hidden layer contains five nodes.

The input to each hidden node is a linear combination of predictors. In the example above, the output of the first four hidden nodes is the hyperbolic tangent function of that linear combination of predictors. The output from the last hidden node is a linear function of that linear combination of predictors.

Figure 2: Input to hidden nodes is a linear combination of predictors. Output of hidden nodes is either a TanH or linear function of input.

You’ll notice that the TanH function has an S-shape, similar to the logistic function used in logistic regression models. It is nonlinear and flexible. In this example, the TanH function is applied in four nodes.

$H(x) = a + b \cdot \tanh(c \cdot x + d)$

Figure 3: Animation of parameters of the flexible, nonlinear hyperbolic tangent (TanH) activation function.

The linear function is simple – the output is a linear function of the input, just like in regression: $H(x) = a + b \cdot x$. It is often useful in neural networks when the data are collected over time.

The parameters for each node are found by selecting some values at random, then running an iterative algorithm that changes the values of the parameters to optimize an objective function.

The output of the neural network is a linear combination of the hidden nodes. Each hidden node is a nonlinear function of a linear combination of the predictors. The prediction of the continuous response, then, is a linear combination of the H functions.

Figure 4: Prediction is a linear combination of output of hidden nodes.

What about a model for the categorical response, Quality Level? The same neural structure can be used. However, this model will predict the probability of membership in each Quality Level class. To compute these probabilities, a logistic function is applied to the outputs from the hidden layer.

$\text{logit}(\hat{\pi}) = \log\left(\frac{\hat{\pi}}{1 - \hat{\pi}}\right) = \hat{w}_0 + \hat{w}_1 H_1 + \hat{w}_2 H_2 + \hat{w}_3 H_3 + \hat{w}_4 H_4 + \hat{w}_5 H_5$

Predicted value for one level of a categorical response is a logistic function of the output of the hidden nodes.

A neural network with two hidden layers

A second hidden layer can be useful to model discontinuous behavior in the response. Consider a response that is strongly dependent on a single categorical predictor. When changing from one level of the predictor to another, there’s not a smooth ramp up in the response; rather, there’s a discontinuous jump. That’s a great case for using a second hidden layer.

Figure 5: A neural network model with two hidden layers.

What are the key differences between neural network and deep learning models?

Neural networks are often components of deep learning models. However, neural models for traditional predictive modeling and deep learning models are different. Here are some key differences.

In traditional neural models, activation functions include TanH and linear functions for statistical modeling. Gaussian functions can be used for function estimation. In deep learning models, other activation functions, such as variants of rectified linear functions, are used to minimize neuron saturation.

Typically, only one or two hidden layers are used in traditional predictive modeling using neural networks. In deep learning, the models can contain many hidden layers.

In deep learning, techniques other than simple holdout validation are used for honest assessment, including L1 or L2 regularization, dropout, and batch normalization.

Deep learning models are powerful predictive models. They often require the use of a GPU instead of a CPU to perform calculations.

Example of a neural network with a continuous response

Let’s fit a neural network model with one hidden layer containing four TanH nodes and one linear node to the continuous response Percent Recovered.

Figure 6: Measures of fit for a continuous response of a neural network model on training, validation, and test sets.

Remember that the training set is used to fit the model, the validation set is used to compare models, and the test set is used to evaluate real-world performance after the final model has been selected. You’ll notice that R-square on the training set is fairly high (for observational data like these) at about 88%. R-square on the validation set is lower, at about 76%. The fit statistics on the test set can be ignored until a model is selected.

The next step would probably be tuning the neural model – finding the values of the model hyperparameters to optimize the validation fit statistics.

Example of a neural network with a categorical response

Let’s fit the same neural network structure – one hidden layer containing four TanH nodes and one linear node – to the categorical response Quality Level.

Figure 7: Measures of fit for a categorical response of a neural network model on training, validation, and test sets.

We see that the misclassification rate on the training set (6%) is about half the misclassification rate on the validation set (12%). The model describes the training data very well but does not perform as well on the validation data. Remember, the validation data were not used to fit the model.

This behavior of models fitting much better on the training set than the validation set is called overfitting. Neural models are notorious for overfitting. That’s why it is vital to use honest assessment including a test set to select and evaluate neural network models.

layout

2 column

Style

columns-75-25, section-top-padding-xsmall