Neural Networks
What is a neural network?
An artificial neural network is a very flexible model that can handle complex relationships between inputs and outputs. Neural networks are modeled after neuron functioning in the human brain. In a biological neuron, input signals from many other neurons are processed to produce an output signal. Humans have billions of neurons in interconnected neural networks. The information from these networks is combined, enabling us to learn from past experiences and generalize our knowledge. The first artificial neural networks were developed to study and mimic the ability of the human brain to learn from past experience.
What are some of the advantages and disadvantages of neural networks?
Advantages
- Neural networks can model continuous or categorical responses.
- A neural network is a very flexible model, so you can achieve lower bias without increasing variance.
- Neural networks don’t have the usual statistical assumptions of normality or independence.
- Neural networks can detect complex nonlinear relationships.
Disadvantages
- Neural networks are a black box – they can be difficult to understand and interpret.
- Large data sets are needed to adequately train a neural network.
- Care must be taken to avoid overfitting.
- Choosing which activation functions, the number of hidden nodes, and the number of hidden layers can be difficult.
A neural network with one hidden layer
An artificial neural network has an input layer, one or more hidden layers, and an output layer. Let’s consider a simple neural network for the Recovery data we introduced in our overview of predictive modeling. We’ll start by building a model using the continuous response, Percent Recovered.
The diagram of the neural network shows its structure. In ordinary linear regression, a linear combination of the x variables predicts the response. In the neural model, there is an extra layer between the predictors and the response. In this example, the hidden layer contains five nodes.
The input to each hidden node is a linear combination of predictors. In the example above, the output of the first four hidden nodes is the hyperbolic tangent function of that linear combination of predictors. The output from the last hidden node is a linear function of that linear combination of predictors.
You’ll notice that the TanH function has an S-shape, similar to the logistic function used in logistic regression models. It is nonlinear and flexible. In this example, the TanH function is applied in four nodes.
The linear function is simple – the output is a linear function of the input, just like in regression: $H(x) = a + b \cdot x$. It is often useful in neural networks when the data are collected over time.
The parameters for each node are found by selecting some values at random, then running an iterative algorithm that changes the values of the parameters to optimize an objective function.
The output of the neural network is a linear combination of the hidden nodes. Each hidden node is a nonlinear function of a linear combination of the predictors. The prediction of the continuous response, then, is a linear combination of the H functions.
What about a model for the categorical response, Quality Level? The same neural structure can be used. However, this model will predict the probability of membership in each Quality Level class. To compute these probabilities, a logistic function is applied to the outputs from the hidden layer.
$\text{logit}(\hat{\pi}) = \log\left(\frac{\hat{\pi}}{1 - \hat{\pi}}\right) = \hat{w}_0 + \hat{w}_1 H_1 + \hat{w}_2 H_2 + \hat{w}_3 H_3 + \hat{w}_4 H_4 + \hat{w}_5 H_5$
A neural network with two hidden layers
A second hidden layer can be useful to model discontinuous behavior in the response. Consider a response that is strongly dependent on a single categorical predictor. When changing from one level of the predictor to another, there’s not a smooth ramp up in the response; rather, there’s a discontinuous jump. That’s a great case for using a second hidden layer.
What are the key differences between neural network and deep learning models?
Neural networks are often components of deep learning models. However, neural models for traditional predictive modeling and deep learning models are different. Here are some key differences.
In traditional neural models, activation functions include TanH and linear functions for statistical modeling. Gaussian functions can be used for function estimation. In deep learning models, other activation functions, such as variants of rectified linear functions, are used to minimize neuron saturation.
Typically, only one or two hidden layers are used in traditional predictive modeling using neural networks. In deep learning, the models can contain many hidden layers.
In deep learning, techniques other than simple holdout validation are used for honest assessment, including L1 or L2 regularization, dropout, and batch normalization.
Deep learning models are powerful predictive models. They often require the use of a GPU instead of a CPU to perform calculations.
Example of a neural network with a continuous response
Let’s fit a neural network model with one hidden layer containing four TanH nodes and one linear node to the continuous response Percent Recovered.
Remember that the training set is used to fit the model, the validation set is used to compare models, and the test set is used to evaluate real-world performance after the final model has been selected. You’ll notice that R-square on the training set is fairly high (for observational data like these) at about 88%. R-square on the validation set is lower, at about 76%. The fit statistics on the test set can be ignored until a model is selected.
The next step would probably be tuning the neural model – finding the values of the model hyperparameters to optimize the validation fit statistics.
Example of a neural network with a categorical response
Let’s fit the same neural network structure – one hidden layer containing four TanH nodes and one linear node – to the categorical response Quality Level.
We see that the misclassification rate on the training set (6%) is about half the misclassification rate on the validation set (12%). The model describes the training data very well but does not perform as well on the validation data. Remember, the validation data were not used to fit the model.
This behavior of models fitting much better on the training set than the validation set is called overfitting. Neural models are notorious for overfitting. That’s why it is vital to use honest assessment including a test set to select and evaluate neural network models.