Overfitting of Data (AI, ML)

We say a hypothesis 'overfits' the data if we can find a different hypothesis with more training error but less actual data error.

So, let h be a hypothesis. Let h' also be a hypothesis.

(1)
\begin{align} \text{If } error_{train}(h) < error_{train}(h') \text{ & } error_{alldata}(h) > error_{alldata}(h') \end{align}

I.e. we don't like hypotheses that seem like they're really accurate (on the training data), but turn out to be not very (on the test data).

# Avoiding Overfitting

The way to combat overfitting is to simplify the whole network (to avoid making it really precise) (in artificial neural networks we can remove hidden nodes/connections, limit training times and use a validation set).

We can use pruning to avoid overfitting in this way.

## Laplace Error

According to Ockham’s Razor, we may wish to prune off branches that do
not provide much beneﬁt in classifying the items.

When a node becomes a leaf, all items will be assigned to the majority class
at that node. We can estimate the error rate on the (unseen) test items
using the Laplace error:

(2)
\begin{align} E = 1 − \frac{n+1}{N + k} \end{align}

N = total number of (training) items at the node
n = number of (training) items in the majority class
k = number of total classes for that attribute

If the average Laplace error of the children exceeds that of the parent node,
we prune off the children.

### Example

[7,3]
[5,2] [2,1]

Laplace error = 1 - (n+1)/(N+k)
n = #items in majority class (i.e. 7)
N = #total items (i.e. 10)
k = #classes (i.e. 2)

therefore
error(all) = 1 - 8/12 = 1/3
error(left) = 1 - 6/9 = 1/3
error(right) = 1 - 3/5 = 2/5

Backed up error = (7/10)(1/3) + (3/10)(2/5)
= 7/30 + 6/50
= (35+18)/150 = 53/150 >= 1/3 = prune (i.e. subtree error > all/parent error)

## Types of Pruning

page revision: 19, last edited: 03 May 2012 00:27