An artificial neural network (ANN) is a computational model inspired by the information processing functionality of the brain. But how does the brain compute?
Generally, the central elements of computation are processing, transmission, and storage. Within the brain the neuron is the central computing element. Neurons receive signals and produce responses. The transmission of information at the neural level involves electrical signals – so called action potentials – based broadly on ions and semi-permeable membranes, and chemical signals at the synapses. In the brain the storage of information corresponds to learning which occurs at the synapses. These synapses are at the interface between neurons and regulate the transmission of information from neuron to neuron.
An ANN widely corresponds to the processing paradigm of neural networks with the nodes of the ANN being the central computing element similar to the neuron. In fact, ANNs are nothing but networks of primitive functions where the chain of function compositions transforms an input to an output. The composition of the computational model is contained implicitly in the interconnections of the nodes and is referred to as the network function. Each node comprises a primitive function transforming its input into an output:
Typically, the inputs of a node have an associated weight
w by which the input x i is multiplied. The node integrates all its inputs – usually by adding the different inputs – followed by the evaluation of its primitive function
f. The primitive function
f computed in the node can be any function but common choices are differentiable functions such as the sigmoid function. Models of ANNs mainly differ in their choice of the primitive function, the topology of the network, and rarely in the timing of the evaluation of the primitive function. In
feed-forward ANNs the network is composed of distinctive layers where each neuron only receives input from neurons of the previous layer. Accordingly, a feed-forward network has a distinct input and output layer with the intermediate layers being referred to as hidden layers:
(A second class of ANNs are recurrent networks where connections between nodes form directed cycles.)
The network function of an ANN can be understood as a universal function approximation. However, the difference between ANNs and a Taylor or Fourier series is that the function to be approximated is given not explicitly but implicitly, through a representative set of input-output examples. It will be the task of the learning algorithm to adjust the parameters of the ANN to reflect the input-output examples and to extrapolate to new input patterns in an optimal manner. The learning algorithm is an adaptive method by which the network self-organises to reflect the function to be approximated. The computational effort directly relates to the number of parameters and therefore to the topology of the network and increases substantially for more complicated ANNs. It was not until the proposal of
back-propagation as a learning algorithm [Werbos, 1974] that the application of ANNs gained momentum and it has been the most widely used algorithm for neural network learning ever since.
The back-propagation algorithm uses gradient descent on the error function of an ANN in weight space. Thus, the weights of an ANN which minimise its error function are considered to be the solution of the learning problem. As a precondition for gradient descent the error function of an ANN needs to be continuous and differentiable. Since the ANN is simply the composition of its primitive functions the error function becomes differentiable if the networks primitive functions are differentiable themselves.
In the back-propagation algorithm an ANN is initialised randomly with weights. Next, the gradient of the error function is computed recursively and the weights of the ANN are adjusted accordingly using gradient descent. Because an ANN is a complex chain of a sequential function composition the chain rule plays a most important role in calculating the gradient of the network function's error. The back-propagation algorithm implements the chain rule for the recursive calculation of the gradient of the error function in weight space in a very efficient manner.
Learning in an ANN with back-propagation consists of two stages: in the first stage – the
feed-forward step – the information progresses form the input layer throughout the network towards the output layer.
Each node of the network evaluates its primitive function \(f_j(e)\) and emits the result \(y_j\) to the connected nodes in the subsequent layer. Additionally, each node calculates and stores the derivative of its primitive function \(df_j(e)/de\).
The second stage -- the back-propagation step -- consists in reversing the flow of information throughout the network whereby a unit input propagates from the output layer towards the input layer with the activation of each neuron now being the back-propagation term \(\delta_j\).
At each node the back-propagation term \(\delta_j\) is multiplied by the stored derivative of the node's primitive function from the previous feed-forward step which gives the gradient in weight space \((d f_j(e)/de) \delta_j\).
Finally, the weights are updated using gradient descent as given by
$$
w'_{i,j} = w_{i,j} + \alpha y_{i} \frac{d f_j(e)}{de} \delta_j
$$
with \(\alpha\) being the learning rate and \(w_{i,j}\) being the weight of the feed-forward connection from neuron \(i\) in the previous layer to neuron \(j\) in the subsequent layer.
[Werbos, 1974] Beyond regression: New tools for prediction and analysis in the behavioural sciences, Pd.D. Thesis, Harvard University (1974).
[Gurney, 1997] An introduction to neural networks, UCL Press (1997).
[Montavon, 1998] Neural Networks: Tricks of the Trade, Springer (1998).