Evaluating Hypotheses

In theory, there is no dierence between theory and practice.

But, in practice, there is.

# Estimating Hypothesis Accuracy

To estimate we measure the following:

- How well does a hypothesis generalize beyond the training set?
- need to estimate off-training-set error

- What is the probable error in this estimate?
- If one hypothesis is more accurate than another on a data set, how

probable is this difference in general ?

# Two Definitions of Error

Algorithm Independent - we're trying to evaluate algorithms.

The difference between sample and true error is where the data comes from - the training set or the entire space?

The **sample error** of *h* with respect The **sample error** of *h* with respect to target function *f* and sample data *S* is the proportion of examples that *h* misclassifies.

h(x) = what hypothesis/algorithm predicts, f(x) = what the correct answer is.

$\delta(y)$ is one if y's true, and 0 if y's false (i.e. 1 if $f(x) \neq h(x)$.

The **true error** of hypothesis *h* with respect to target function *f* and total set *D* is the proportion of examples that *h* misclassifies. I.e. it's the probability that *h* will misclassify an instance drawn

at random from *D*.

## The Question: How well does error_s(h) estimate error_d(h)?

To figure this out we can do it as an experiment:

- Choose sample
*S*of size*n*according to the distribution*D* - Measure error_s(h)

Error_s(h) is the result of an experiment, an unbiased estimator for error_d(h). Given observed errorS(h) what can we conclude about errorD(h)?

## Problems Estimating Error

### Errors in Error (Estimation Bias)

Definition 2 of Bias.

Estimation Bias: If S is the training set, error_s(h) is optimistically biased -> it's probably overfitting and hence the error will lower (positively biased).

For an unbiased estimate, h and S must be chosen independently (S cannot be the training set).

As the complexity increases the accuracy (1-error) of the training sets gets dramatically more than the test set. So up unto a point of complexity they are close, but cross the the threshold they split dramatically.

### Variance

Variance: Even with an *S* selected to give unbiased estimate, error_s(h)

may still vary from error_d(h).

Estimation bias is a numerical quantity [comes from statistics].

Inductive bias is a set of assertions [comes from concept learning].

### Calculating Error_d(h) - Confidence Intervals

If *S* contains *n* examples, drawn independently of *h* and *n* >= 30 then:

With n% probability error_d(h) lies in the interval:

(4)The zN values come from Statistical tables.

## Example

Hypothesis *h* misclassifies 12 of the 40 examples in *S*.

error_d(h):

Given no other information, our best estimate is 0.30 (error_s(h).

But for repeated samples of 40 examples we can use the confidence intervals to calculate that with 95% probability error_d(h) lies within $.30 \pm .14$

# Sign Test

In statistics we sometimes prefer to use non-parametric tests. These:

- Make few assumptions
- Are more robust
- May lose some information
- Used as a first step in comparisons, or as a complementary measure.
- e.g., use a sign test when comparing runtime of two learning algorithms to bypass effects of large magnitude differences

## Example: Comparing 2 Learning Algorithms with Sign Tests

Comparing error:

(6)I.e. there are two Algorithms and N test sets.

(7)We test X against a binomial distribution (with a null h) with *p = 0.5*.