Pruning

We can use pruning to avoid overfitting of data.

Laplace Error

We can use this error as the base error formula for our pruning functions.

When a node becomes a leaf, all items will be assigned to the majority class
at that node. We can estimate the error rate on the (unseen) test items
using the Laplace error:

(1)
\begin{align} E = 1 − \frac{n+1}{N + k} \end{align}

N = total number of (training) items at the node
n = number of (training) items in the majority class
k = number of total classes for that attribute

If the average Laplace error of the children exceeds that of the parent node,
we prune off the children.

Pre-Pruning

Stops growing the tree when there is no statistically significant association between any attribute and the class at a particular node.

We use a chi-squared test followed by an information gain procedure (using only statistically significant attributes)

Early Stopping

Pre-pruning can suffer from early stopping, where the tree growth may be stopped early.

An example is the XOR/Parity problem, where the target structure is only visible in the fully expanded tree. As no individual attribute appears to exhibit a significant association with the outcome class pre-pruning won't expand the root node to get to this fully expanded tree.

However these types of problems are uncommon, and pre-pruning is much faster than post-pruning.

Post-Pruning

Grow the full tree then remove sub-trees which are overfitting. This avoids the problem of 'early stopping'.

Being able to see the fully grown tree and all the attribute interactions within it is good, but can sometimes mean we can't identify which subtrees etc are due to chance.

The pruning operations applied afterwards are:

  • Subtree replacement
    • Bottom Up - once all the sub-trees have been considered we may replace it with the dominant outcome
  • Subtree raising
    • Kind of like tree rotations - slower, more complex.

Possible strategies for these are:

  • Error estimation
  • Significance testing
  • MDL principle

Reduced-Error Pruning or Minimal-Error Pruning

When we're post-pruning we need to somehow work out what to prune. Reduced-Error pruning works like:

  • Split data into training and validation set
  • Repeat until further pruning is harmful:

1. Evaluate the impact on the validation set of pruning each possible node (plus
those below it)
2. Greedily remove the one that most improves the validation set accuracy

This leads to the smallest version of the most accurate subtree, but reduces the effective size of the training set.

Error-Based Pruning

C4.5 (Error-Based Pruning) is the successor to ID3. It uses the training set to do post-pruning, by converting the tree to rules.

The goal is to improve the estimate of error on unseen data using only training set data. Hence if a node doesn't increase estimated error, we prune it.

Method

Using an upper limit of 25% confidence interval derived from the training data we use a standard Bernoulli-process-based method.
Note: this is statistically motivated but not statistically valid. However it works well in practice.

Error Estimate

The error estimate for a subtree is the weighted (Based on how many instances of each there are) sum of the error estimates for all its leaves.

The error estimate for a node/leaf is:

(2)
\begin{align} e = \frac{f+z^2/2N + z\sqrt{\frac{f}{N}-\frac{f^2}{N} + \frac{z^2}{4N^2}}}{1 + \frac{z^2}{N}} \end{align}

Where:

  • e = error estimate for node
  • f = error on training data
  • N = number of instances covered by the node
  • z = upper confidence limit (determined by the confidence interval)

As c goes down, z goes up. We calculate z by the following:

(3)
\begin{align} Pr\left|\frac{f - q}{\sqrt{q(1-q)/N}} > z\right| = c \end{align}

Where q = the true probability of error at the node

The probability of the fraction being greater than z is equal to c, the confidence interval.

ExampleTree.png

Example

In the above example we can see that at the base level (health plan contribution) the figures are:

  • N = 14
  • z = 0.69 (this is the value with a standard c=25%)
  • f = 5/14 (0.36) - there are 14 test cases, 9 are correct, 5 are errors
  • e = 0.46 (by the formula).

At the leaf nodes the figures are:

  • none: N = 6, f = 0.33, e = 0.47
  • half: N = 2, f = 0.5, e = 0.72
  • all: N = 6, f = 0.33, e = 0.47

We weight these before averaging them by doing this in the ratio of their N's. i.e. we average 3(0.47), 1(0.72), 3(0.47) (add them and divide by 3+1+3) - to get 0.51 as the error estimate.

Given the initial e was 0.46, this e (0.51) is worse and hence we prune the subtree away.

Rule Post-Pruning

Used in C4.5

Explanation

Take a simple tree diagram, and use AND and OR statements to create IF THEN sequences.

E.g. IF (outlook == sunny) && (humidity == high) THEN enjoysport = No.

Pros

  • Simpler classifiers - deleting conditions simplifies/generalises the rules
  • People find it easier to use rules than trees

Cons

  • It doesn't scale well
  • It's slow for large trees/datasets

Algorithm

The goal is to remove rules that are not useful in terms of accuracy by finding a subset of rules which maximises an MDL criteria. There's a tradeoff between accuracy and complexity.

  1. Convert the tree to a set of equivalent rules
  2. Prune each rule independently of others
  3. Sort final rules into a the desired sequences for use

If we drop the condition with the lowest estimated error we can greedily simplify the rule.