In a Multi-Layer Perceptron we adjust the weights as we go to tune it. To do this efficiently we need to calculate the partial derivatives of the Error with respect to the weight.

Here's how.

# Backpropagation Algorithm

The backpropagation algorithm learns the weights for a multilayer network, given a network with a fixed set of units and interconnections. It employs gradient descent to attempt to minimize the squared error between the network output values and the target values for these outputs.

It's called backpropagation because we're going backwards from the output to the hidden nodes, efficiently calculating (and then reusing) partial derivatives to adjust the weights.

It's not local, so probably can't actually happen in the brain.

## Partial Derivatives (Usefulness of Backpropagation)

Partial derivatives measure the sensitivity of weights to output.
They can tell you which node could make the most difference to the output (i.e. which bit is making it screw up) - tells you which weights to change around so that things are having different effects. We change the ones with a higher partial derivative.

Partial derivatives can be calculated efﬁciently by packpropagating deltas
through the network.

## Code

The input from unit i into unit j is denoted x[j,i], and the weight from unit i to unit j is denoted w[j,i].

for each i, w[i] = small random number;

while (!satisfied) {
for each training example:
input it to the network and compute the outputs
for each output k we calculate the error term:
δ[k] = o[k](1 − o[k])(t[k] − o[k]);
for each hidden unit h we calculate the error term:
δ[h] = o[h](1 − o[h]) Sum of k∈outputs w[h,k]δ[k]
for each network weight w[i,j]:
∆w[i,j] = ηδ[i]x[i,j]; // slides said error[j], but textbook disagrees.
w[i,j] = w[i,h]  + ∆w[i,j];

}


## Speed

Training can take thousands of iterations; is very slow.
Using the network after training is very fast.

# Cross Entropy

The least squares error function doesn't work for classifying instances where target = 0 or 1. We can instead use the cross entropy error function. The mathematical theory behind this is maximum likelihood.

Using the following Error formula (rather than a sum)

(1)
\begin{equation} E = -t log(z) - (1 - t)log(1-z) \end{equation}

## Maximum Likelihood

H is a class of hypotheses, and h any hypothesis within it
$P(D|h)$ = probability of the dataset D being generated under hypothesis h ∈ H.
$logP(D|h)$ is called the likelihood.

The Maximum Likelihood Principle is just choosing the h which maximises the likelihood. (I.e. has a higher log and hence a higher probability of generating the data).

## Example - Cross Entropy with Least Squares Line Fitting # Weight Decay

Weights tend towards getting larger and larger and becoming saturated with very high values. Weight decay normalises the weights and keeps them smaller.

This is based on our assumption that small weights are more likely to occur (correctly) than large weights (i.e. probability is higher):

(2)
\begin{align} P(w) = \frac{1}{Z}e^{\frac{\lambda}{2}\sum_jw_j^2} \end{align}

Z = a normalising constant, and λ is something we need to figure out (either empirically or from experience).

Error then becomes:

(3)
\begin{align} E = \frac{1}{2}\sum_i (z_i - t_i)^2 + {\lambda}{2} \sum_j w_j^2 \end{align}

The mathematical theory behind this is Bayes' Theorem.

## Bayes Rule

If H is a class of hypotheses
$P(D|h)$ = probability of data D being generated under hypothesis hH.
$P(h|D)$ = probability that h is correct, given that data D were observed.

Bayes Theorem:

(4)
\begin{align} P(h|D) = \frac{P(D|h)P(h)}{P(D)} \end{align}

P(h) is called the prior.

# Weight Momentum

Including the weight momentum α (alpha), can help to minimise error over training examples.

(5)
\begin{align} \Delta w_{i,j}(n) = \eta\delta jx_{i,j} + \alpha\Delta w_{i,j}(n − 1) \end{align}

Or globally:

(6)
\begin{align} \delta w \leftarrow \alpha \delta w + (1 - \alpha) \frac{\delta E}{\delta w} \end{align}
(7)
\begin{align} w \leftarrow w - \eta \delta w \end{align}

tl;dr helps dampen sideways oscillations (not useful) but amplifies downhill motion (by $\frac{1}{1-\alpha}$.

## Convergence

The effect of α is to add momentum that tends to keep the ball rolling in the same direction from one iteration to the next. This can sometimes have the effect of keeping the ball rolling through small local minima in the error surface, or along flat regions in the surface where the ball would stop if there were no momentum. It also has the effect of gradually increasing the step size of the search in regions where the gradient is unchanging, thereby speeding convergence.

# Conjugate or Natural Gradients (vs Linear Descent)

We can change the function that we try to minimise. Instead of looking for the steepest gradient, we could compute a matrix of second derivatives (the Hessian) and then look for the minimum of a quadratic function used to approximate the landscape.

(8)
\begin{align} \frac{\delta ^2 E}{\delta w_i \delta w_j} \end{align}

We could also use methods from information geometry to find a 'natural' re-scaling of the partial derivatives.

# General Training Tips

• Rescale inputs and outputs to be from 0 to 1 or -1 to 1 (sigma function)
• Initialise weights to be v. small random values
• Use batch learning (mentioned up in perceptrons)
• Prevent overfitting