Instance Based Learning (AKA Rote Learning)
  • outline the key ideas in locally-weighted regression
  • describe the framework of case-based reasoning
  • contrast lazy versus eager learning

Intro to Rote Learning or Instance-Based Learning

Rote learning is the simplest form of learning - new training instances are matched with the most similar already in the dataset and given that classification.

We hence say that its inductive bias is the assumption that the classification of an instance will be the same as other examples close by, according to the distance function.

It's lazy learning - it never computes a hypothesis function, only classifies each query as it goes.

Distance Function

To compare the similarity of two instances we work out the distance between them.

There are generalised distance functions that can be applied to both discrete and continuous attributes.

There are a couple of different cases:

  • Discrete attributes: Distance is 1 if they're different, 0 if they're equal
  • One numeric (continuous) attribute: Distance is the difference between the two attribute values involved
  • Several numeric attributes: Euclidean distance is normallyused with normalized attributes

Distance between two instances (vectors of attribute-values) would then be:

(1)
\begin{align} d(x_i, x_j) = \sqrt{\sum^m_{r=1} (a_r(x_i) - a_r(x_j))^2} \end{align}

We can also use other distance functions like the Manhattan distance

We also need to take into account the relative importance of each attributes and potentially weight them.

Normalization

The reason we normalize attributes is to ensure that if $f(x_i) = c$ for all training examples, then $~f(x_q) \leftarrow c$.

(2)
\begin{align} a_r ] \frac{v_r - min v_r}{max v_r - -min v_r} \end{align}

With missing values it's common policy to assume they're maximally distant.

Nearest Neighbour Algorithm

The nearest neighbour locates the most similar training example and then estimates its classification based on that.

(3)
\begin{align} \text{find } x_n \text{ for } x_q \text{ and then work out } ~f(x_q) \leftarrow f(x_n) \end{align}

The ~f(x) means it is our estimated hypothesis function.

k-Nearest Neighbours

The k-Nearest neighbour takes input from the closest k instances - if the target is discrete-valued it takes a vote, if it's real-valued it takes a mean.

(4)
\begin{align} ~f(x_q) \leftarrow \frac{\sum^k_{i = 1} f(x_i)}{k} \end{align}

We classify instances as a sum of 1s and 0s depending on if their value matches up with the value of each of the k closest instances.

(5)
\begin{align} ~f(x_q) \leftarrow \text{ argmax }_{v \in V} \sum^k_{i = 1}\delta(v, f(x_i)) \end{align}

$\delta(a, b) == 1$ if a = b, else 0.

Hypothesis Space for Nearest Neighbour

This is the k-Nearest Neighbour for the case where the instances are points in a two-dimensional space and where the target function is boolean valued.

Screen%20Shot%202012-04-15%20at%2011.36.06%20AM.png

When to consider Nearest Neighbour

  • When instances map to points in $\mathbb{R}^n$
  • There are less than 20 attributes per instance (that matter)
  • There's lots of training data
  • There's no requirement for an "explanatory" model to be learned

Advantages of Nearest Neighbour

  • Statisticians have been using it since the early 1950s, so there's a lot of experience with it
  • It can actually be very accurate (as n gets to infinity and k/n to 0, the error approaches minimum)
  • Training is very fast (just add all the instances, that's it)
  • It can learn quite complex target functions
  • As it makes no generalisations it doesn't lose any information

Disadvantages of Nearest Neighbour

  • It's slow to query (it has to scan all the training data)
  • Assumes all attributes are equally important, unless fixed with weighting or attribute selection
  • It doesn't deal well with noisy instances. We can remove them from the dataset, but only if we know which they are.

Terminology from Stat Pattern Recognition

  • Regression = approximating a real-valued function
  • Residual = the error in approximating the target function - ~f(x) - f(x)
  • Kernel Function = the function of the distance used to determine the weight of each training example. i.e. the function K where $w_i = K(d(x_i, x_q))$

Distance-Weighted k-Nearest Neighbour

If we're weighting neighbours we need a function to figure out the weight:

(6)
\begin{align} w_i = \frac{q}{d(x_q, x_i)^2} \end{align}

The classification algorithm then becomes:

(7)
\begin{align} ~f)x_q) \leftarrow \text{ argmax}_{v \in V} \sum^k_{i = 1} \delta(v, f(x_i)) \end{align}

Or for real-valued target functions:

(8)
\begin{align} ~f(x_q) \leftarrow \frac{\sum^k_{i = 1} w_i f(x_i)}{\sum^k{i=1} w_i} \end{align}

The denominator is used to normalize the contribution of each weight.

With these we can consider using all the training examples (so set k=n). This is called Shephard's method.

Evaluation

Lazy Learners do not construct an explicit method, which makes it hard to evaluate the output of the learning process.
With 1-Nearest Neighbour the training set error is always zero, as it is always closest to itself. With k-Nearest Neighbour overfitting may be hard to detect.

We can use LOOCV (leave one-out cross-validation) to leave out an example and predict it given the rest. Error is the mean over all the predicted examples.

Curse of Dimensionality

The 'curse of dimensionality' is the problem that it's easy to miss the nearest neighbour when there are lots of attributes (dimensions) - a lot of the attributes are often irrelevant.

One approach to handling this is to stretch each axis (attribute) by its weight (weight chosen with cross-validation to minimise prediction error). Setting z_j (weight of the attribute) to 0 eliminates that dimension/attribute.

Methods

  • Attribute Weighting:

Attributes can be weighted specifically based on their class (which can cause unclassified instances and multiple classifications).

  • Get Euclidean Distance with Weights:

$\sqrt{\sum w_r(a_r(x_q) - a_r(x))^2}$

  • Update weights based on nearest neighbour:

If the class is correct/incorrect the weight is increased/decreased correspondingly.
If $|a_r(x_q) - a_r(x)|$ is small/large then the change amount is large/small correspondingly.

Locally Weighted Regression

The nearest-neighbour approaches described in the previous section can be thought of as approximating the target function f (x) at the single query point x = x_q. Locally weighted regression is a generalisation of this approach.

Remembering that regression is approximating a real-valued function, locally weighted regression is using k-Nearest Neighbour (k-NN) to form a local approximate to f for each query point, using a linear function of the form:

(9)
\begin{equation} ~f(x) = w_0 + w_1a_1(x) + w_2a_2(x) + ... + w_na_n(x) \end{equation}

Explanation of Name

The phrase "locally weighted regression" is called

  • local because the function is approximated based a only on data near the query point
  • weighted because the contribution of each training example is weighted by its distance from the query point
  • regression because this is the term used widely in the statistical learning community for the problem of approximating real-valued functions.

Key Ideas

  • Fit linear/quadratic/polynomial function to the k nearest neighbours
  • Produces a piecewise approximation to f.

In statistics it's known as LO(W)ESS.

When we're going from a global approximation to a local approximation we can choose several different errors to minimise:

  1. Squared Error over k-NN
(10)
\begin{align} E_1(x_1) = \frac{1}{2}_{x \in \text{ k-nn of }x_q} (f(x) - ~f(x))^2 \end{align}
  1. Distance-Weighted Squared Error over all Neighbours
(11)
\begin{align} E_2(x_q) = \frac{1}{2} \sum_{x \in D} (f(x) - ~f(x))^2 K(d(x_q,x)) \end{align}

K being the kernel function

  1. Combination of 1 and 2
(12)
\begin{align} E_3(x_q) = \frac{1}{2}_{x \in k-NN \text { of } x_q} (f(x) - ~f(x))^2 K(d(x_q,x)) \end{align}

3 gives the local training rule:

(13)
\begin{align} \Delta w_r \eta _{x \in k-NN \text{ of } x_q} K(d(x_q, x))(f(x) - ~f(x))a_r(x) \end{align}

As in, the weight change for the attribute is the sum of the weight of it [K(d(a, b))] multiplied by the difference between the actual and expected classification, multiplied by the value of the attribute.

So the weight change is the same as the global rule, but replace the 1/2 with an eta, and replace one of the $(f(x) - ~f(x))$ with an $a_r(x)$.

Edited NN

Edited NN classifiers discard some of the training instances before making predictions:

  • Saves memory
  • Speeds up classification

Instance-Based Learning 2 (IB2) uses an incremental NN learner - it only incorporates misclassified instances into the classifier (incorporates noisy data).

Dealing with Noise

Using a larger k we can deal with noise somewhat better (more samples = more robust).

We can use cross-validation, but it's slow. So how about IB3:

IB3 stores classification performance information within each instance and only uses the instance in prediction if its performance is above a threshold.

  • Computers confidence interval for instance's success rate and for default class accuracy.
  • If lower limit of first interval is above upper limit of second one, accept it.
  • If upper limit of first interval is below limit of second one, reject it.