# Perceptron

The perceptron is a type of artificial neural networks. It can be seen as the simplest kind of feedforward neural network: a linear classifier.

Originally, a (discontinuous) step function was used for the transfer function:

(1)or in simpler vector notation:

(2)## Perceptron Training Rule for Linear Models

(3)where

(4)where

- $t = c(\vec{x})$ is the target value
*o*is the perceptron output- $\eta$ is a small constant (e.g. .1) called the
*learning rate*

It will converge if $\eta$ is sufficiently small and the training data is linearly separable.

## Gradient Descent for Linear Models

The perceptron uses Gradient Descent to adapt the weight parameters to a function in order to minimise the squared error ($E[\vec{w}]$.

Gradient is the change in the Error (and we're looking for the steepest gradient down the error curve to reduce it maximally)

(5)### Linear Unit/Gradient Descent Training Rule

Training Rule (the change in a weight is the individual partial derivative multiplied by a small constant).

(6)^how much we need to adjust each weight by is the sum over all the training instances of the distance between the expected output and actual ouptut, multiplied by the negative x of that particular instance.

### Gradient Descent Algorithm

Gradient Descent Algorithm (training_examples, η) { //Each training example is a pair [[$<vec{x}, t>$]] //where the first is the vector of input values //and the t is the target output value. Set each w[i] to be a small random value while (termination != met) { set each ∆w[i] = 0; for each training example: o = output(instance's vector); for each weight: // Work out how we need to adjust the weight ∆w[i] = ∆w[i] +η(t−o)x[i]; // And then do so w[i] = w[i] + ∆w[i]; }

### Stochastic/Incremental Gradient Descent

The difference between Batch Mode and Incremental Mode is that Batch Mode takes the overall gradient of the whole training data and modifies the weight vector based on that, whereas Incremental Mode computes the gradient of each individual training instance and modifies the weight after each instance.

#### Batch Mode Gradient Descent

Repeat until done:

- Compute $\nabla E_{D}[\vec{w}]$
- $\vec{w} = \vec{w} - \eta \nabla E_D [\vec{w}]$

#### Incremental Mode Gradient Descent

Repeat until done:

For each training example d in D:

- Compute $\nabla E_{D}[\vec{w}]$
- $\vec{w} = \vec{w} - \eta \nabla E_D [\vec{w}]$

Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if the η is small enough.

# Multi-Layer Perceptron

The multilayer perceptron is a feedforward artificial neural network. It's an improvement upon the perceptron in that it can distinguish data that is not linearly separable.

(These are networks of Sigmoid Units)

The sigmoid function is essentially the perceptron without the allocation to either 1 or -1 - it can be in between. It's the sideways s-shaped graph that is often used to describe learning over time.

The sigmoid function:

(8)This is very similar to the hyperbolic tangent function:

(9)## The Gradient Descent Error for a Sigmoid Unit

Using the property below we can derive graident descent rules to train a sigmoid unit (or a multilayer network of sigmoid units with *backpropagation*).

There's a bunch of complicated maths behind the error in this, but it boils down to:

(11)### Global vs Local Minima

One major difference in the case of multilayer networks is that the error surface can have multiple local minima, in contrast to the single-minimum parabolic error surface shown above in Gradient Descent. Unfortunately, this means that gradient descent is guaranteed only to converge toward some local minimum, and not necessarily the global minimum error. Often produces good results anyway.

## Weight Adjustment Calculation Techniques

E.g. Backpropagation - going backwards from the output to the hidden nodes, efficiently calculating (and then reusing) partial derivatives to adjust the weights.

# Perceptron Training vs Linear Unit Training vs Sigmoid Function

(12)Perceptron will succeed if the training examples are linearly separable and there's a sufficiently small η (If η is too large, the gradient descent search runs the risk of overstepping the minimum in the error surface rather than settling into it).

Linear Unit training rule will succeed (converge to the hypothesis with the minimum squared error) if there's a sufficiently small η (even with noise and inseparable data).