Perceptron (ANN)

# Perceptron

The perceptron is a type of artificial neural networks. It can be seen as the simplest kind of feedforward neural network: a linear classifier.

Originally, a (discontinuous) step function was used for the transfer function:

(1)
\begin{align} \text{outcome}(x_1,...,x_n) = \left \{ \frac{1\text{ if }w_0+w_1x_1 + ··· +w_nx_n > 0}{ −1 \text{ otherwise}} \right \} \end{align}

or in simpler vector notation:

(2)
\begin{align} \text{outcome}(\vec{x}) = \left \{ \frac{\text{1 if }\vec{w} . \vec{x} \text{ > 0}}{\text{−1 otherwise}} \right \} \end{align}

## Perceptron Training Rule for Linear Models

(3)
\begin{align} w_i \leftarrow w_i + \Delta w_i \end{align}

where

(4)
\begin{align} \Delta w_i = \eta (t-o)x_i \end{align}

where

• $t = c(\vec{x})$ is the target value
• o is the perceptron output
• $\eta$ is a small constant (e.g. .1) called the learning rate

It will converge if $\eta$ is sufficiently small and the training data is linearly separable.

## Gradient Descent for Linear Models

The perceptron uses Gradient Descent to adapt the weight parameters to a function in order to minimise the squared error ($E[\vec{w}]$.

Gradient is the change in the Error (and we're looking for the steepest gradient down the error curve to reduce it maximally)

(5)
\begin{align} \nabla E[\vec{w}] \equiv \left[ \frac{\delta E}{\delta w_0}, \frac{\delta E}{\delta w_1}, ..., \frac{\delta E}{\delta w_n} \right] \end{align}

### Linear Unit/Gradient Descent Training Rule

Training Rule (the change in a weight is the individual partial derivative multiplied by a small constant).

(6)
\begin{align} \Delta w_i = - \eta \frac{\delta E}{\delta w_i} \end{align}
(7)
\begin{align} \frac{\delta E}{\delta w_i} = \sum_d (t_d - o_d)(-x_{i,d}) \end{align}

^how much we need to adjust each weight by is the sum over all the training instances of the distance between the expected output and actual ouptut, multiplied by the negative x of that particular instance.

Gradient Descent Algorithm (training_examples, η) {
//Each training example is a pair [[$<vec{x}, t>$]]
//where the first is the vector of input values
//and the t is the target output value.

Set each w[i] to be a small random value
while (termination != met) {
set each ∆w[i] = 0;
for each training example:
o = output(instance's vector);
for each weight:
// Work out how we need to adjust the weight
∆w[i]  = ∆w[i] +η(t−o)x[i];
// And then do so
w[i] = w[i] + ∆w[i];
}


The difference between Batch Mode and Incremental Mode is that Batch Mode takes the overall gradient of the whole training data and modifies the weight vector based on that, whereas Incremental Mode computes the gradient of each individual training instance and modifies the weight after each instance.

Repeat until done:

1. Compute $\nabla E_{D}[\vec{w}]$
2. $\vec{w} = \vec{w} - \eta \nabla E_D [\vec{w}]$

Repeat until done:
For each training example d in D:

1. Compute $\nabla E_{D}[\vec{w}]$
2. $\vec{w} = \vec{w} - \eta \nabla E_D [\vec{w}]$

Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if the η is small enough.

# Multi-Layer Perceptron

The multilayer perceptron is a feedforward artificial neural network. It's an improvement upon the perceptron in that it can distinguish data that is not linearly separable.

(These are networks of Sigmoid Units)

The sigmoid function is essentially the perceptron without the allocation to either 1 or -1 - it can be in between. It's the sideways s-shaped graph that is often used to describe learning over time.

The sigmoid function:

(8)
\begin{align} \sigma(x) = \frac{1}{1+ e^{-x}} \end{align}

This is very similar to the hyperbolic tangent function:

(9)
\begin{align} \sigma(x) = 2 \frac{1}{1+ e^{-2x}} - 1 \end{align}

## The Gradient Descent Error for a Sigmoid Unit

Using the property below we can derive graident descent rules to train a sigmoid unit (or a multilayer network of sigmoid units with backpropagation).

(10)
\begin{align} \frac{d\sigma(x)}{dx} = \sigma(x)(1-sigma(x)) \end{align}

There's a bunch of complicated maths behind the error in this, but it boils down to:

(11)
\begin{align} \frac{\delta E}{\delta w_i} = -\sum_{d \in D} (t_d - o_d)o_d(1 - o_d)x_{i,d} \end{align}

### Global vs Local Minima

One major difference in the case of multilayer networks is that the error surface can have multiple local minima, in contrast to the single-minimum parabolic error surface shown above in Gradient Descent. Unfortunately, this means that gradient descent is guaranteed only to converge toward some local minimum, and not necessarily the global minimum error. Often produces good results anyway.