Overfitting of Data (AI, ML)


We say a hypothesis 'overfits' the data if we can find a different hypothesis with more training error but less actual data error.

So, let h be a hypothesis. Let h' also be a hypothesis.

\begin{align} \text{If } error_{train}(h) < error_{train}(h') \text{ & } error_{alldata}(h) > error_{alldata}(h') \end{align}

I.e. we don't like hypotheses that seem like they're really accurate (on the training data), but turn out to be not very (on the test data).


Avoiding Overfitting

The way to combat overfitting is to simplify the whole network (to avoid making it really precise) (in artificial neural networks we can remove hidden nodes/connections, limit training times and use a validation set).

We can use pruning to avoid overfitting in this way.

Laplace Error

According to Ockham’s Razor, we may wish to prune off branches that do
not provide much benefit in classifying the items.

When a node becomes a leaf, all items will be assigned to the majority class
at that node. We can estimate the error rate on the (unseen) test items
using the Laplace error:

\begin{align} E = 1 − \frac{n+1}{N + k} \end{align}

N = total number of (training) items at the node
n = number of (training) items in the majority class
k = number of total classes for that attribute

If the average Laplace error of the children exceeds that of the parent node,
we prune off the children.


[5,2] [2,1]

Laplace error = 1 - (n+1)/(N+k)
n = #items in majority class (i.e. 7)
N = #total items (i.e. 10)
k = #classes (i.e. 2)

error(all) = 1 - 8/12 = 1/3
error(left) = 1 - 6/9 = 1/3
error(right) = 1 - 3/5 = 2/5

Backed up error = (7/10)(1/3) + (3/10)(2/5)
= 7/30 + 6/50
= (35+18)/150 = 53/150 >= 1/3 = prune (i.e. subtree error > all/parent error)

Types of Pruning