Evaluating Hypotheses

Evaluating Hypotheses

In theory, there is no dierence between theory and practice.
But, in practice, there is.

# Estimating Hypothesis Accuracy

To estimate we measure the following:

• How well does a hypothesis generalize beyond the training set?
• need to estimate off-training-set error
• What is the probable error in this estimate?
• If one hypothesis is more accurate than another on a data set, how

probable is this difference in general ?

# Two Definitions of Error

Algorithm Independent - we're trying to evaluate algorithms.

The difference between sample and true error is where the data comes from - the training set or the entire space?

The sample error of h with respect The sample error of h with respect to target function f and sample data S is the proportion of examples that h misclassifies.
h(x) = what hypothesis/algorithm predicts, f(x) = what the correct answer is.

(1)
\begin{align} error_z(h) = 1/n \sum_{x \in S} \delta(f(x) \neq h(x)) \end{align}

$\delta(y)$ is one if y's true, and 0 if y's false (i.e. 1 if $f(x) \neq h(x)$.

The true error of hypothesis h with respect to target function f and total set D is the proportion of examples that h misclassifies. I.e. it's the probability that h will misclassify an instance drawn
at random from D.

(2)
\begin{align} error_D(h) = Pr_{x \in D}[f(x) \neq h(x) \end{align}

## The Question: How well does error_s(h) estimate error_d(h)?

To figure this out we can do it as an experiment:

1. Choose sample S of size n according to the distribution D
2. Measure error_s(h)

Error_s(h) is the result of an experiment, an unbiased estimator for error_d(h). Given observed errorS(h) what can we conclude about errorD(h)?

## Problems Estimating Error

### Errors in Error (Estimation Bias)

Definition 2 of Bias.
Estimation Bias: If S is the training set, error_s(h) is optimistically biased -> it's probably overfitting and hence the error will lower (positively biased).

(3)
\begin{equation} bias = E[error_s(h)] - error_d(h) \end{equation}

For an unbiased estimate, h and S must be chosen independently (S cannot be the training set).

As the complexity increases the accuracy (1-error) of the training sets gets dramatically more than the test set. So up unto a point of complexity they are close, but cross the the threshold they split dramatically.

### Variance

Variance: Even with an S selected to give unbiased estimate, error_s(h)
may still vary from error_d(h).

Estimation bias is a numerical quantity [comes from statistics].
Inductive bias is a set of assertions [comes from concept learning].

### Calculating Error_d(h) - Confidence Intervals

If S contains n examples, drawn independently of h and n >= 30 then:

With n% probability error_d(h) lies in the interval:

(4)
\begin{align} -z_N\sqrt{\frac{error_s(h)(1- error_s(h))}{n}} \text{ to } z_N\sqrt{\frac{error_s(h)(1- error_s(h))}{n}} \end{align}

The zN values come from Statistical tables. ## Example

Hypothesis h misclassifies 12 of the 40 examples in S.

(5)
\begin{align} error_s(h) = \frac{12}{40} = .30 \end{align}

error_d(h):
Given no other information, our best estimate is 0.30 (error_s(h).

But for repeated samples of 40 examples we can use the confidence intervals to calculate that with 95% probability error_d(h) lies within $.30 \pm .14$

# Sign Test

In statistics we sometimes prefer to use non-parametric tests. These:

• Make few assumptions
• Are more robust
• May lose some information
• Used as a first step in comparisons, or as a complementary measure.
• e.g., use a sign test when comparing runtime of two learning algorithms to bypass effects of large magnitude differences

## Example: Comparing 2 Learning Algorithms with Sign Tests

Comparing error:

(6)
\begin{align} e_i^L = \text{ error of algorithm } L \in {1,2} \text { on test set } i \in {1, ..., N}. \end{align}

I.e. there are two Algorithms and N test sets.

(7)
\begin{align} x_i = \left\{ \begin{array}{lr} 1 & : e^1_i < e^2_i 0 & : otherwise \end{array} \right. \end{align}
(8)
\begin{align} X = \sum^N_{i = 1}X_i \end{align}

We test X against a binomial distribution (with a null h) with p = 0.5.

page revision: 5, last edited: 25 Apr 2012 06:14