Linear Regression In ML

Regression - assigning weights to different attributes, summing them up and determining a numeric value to ascribe the instance.

Problem of Linear Regression

Linear Regression as a term means both the 'hypothesis representation (the below formula) and the 'learning method' (the way of learning the weights below.

LR assumes that the expected output value of an input E[y|x] is linear (i.e. Out(x) = wx), and that given the data we can estimate the w.

The problem here is that regardless of if the data exhibits linear or non-linear tendencies, we'll find a line of best fit (the least mean squared differences, as shown below).

Representation of Linear Regression Model

The key idea behind linear regression is to calculate the weights from the training data to come up with a predicted value for each instance.

E.g. for the first training instance $x^{(1)}$ the predicted value is:

(1)
\begin{align} w_0x_0^{(1)} + w_1x_1^{(1)} + ... + w_nx_n^{(1)} = \sum_{i = 0}^{n}w_ix_i^{(1)} \end{align}

Least Mean Squares Approach to Fitting Linear Regression

The difference between the predicted and actual values is the error, and the aim is to pick weights that minimise the sum of the squared error over all the training data.

Squared Error:

(2)
\begin{align} \sum_{k=1}^m \left( y^{(k)} - \sum_{i=0}^nw_ix_i^{(k)} \right) ^2 \end{align}

(i.e. the sum of all the actual outcomes minus the sum of all the predicted outcomes).

Deriving Co-Efficients (weights)

We can use standard matrix operations to derive the co-efficients, assuming there are more instances than attributes.

Ordinary Least Squares (OLS) Regression

OLS Regression is a subset of linear regression that minimises the sum of all the squares of the distance between each data point and the estimated regression line.

Regression for Classification

Any regression technique can be used for classification:

  • On training perform a regression for each (target attribute?) class, with the output set to 1 for the training instances that belong to the class, and 0 for those that don't. (Hence generating some weights)
  • For prediction: Calculate the 'membership' to each class (based on the weights), and predict the class which gives us the highest value.

This is know nas multi-response linear regression, and it approximates Boolean class-membership to be between 0 and 1.

Logistic Regression

The problem with linear regression being applied to classification problems is that it violates the assumption that all the errors are statistically independent and normally distributed (the observations only ever take values of 0 or 1, is the violation).

The alternative is logistic regression; transforming the target variable (the probability of the instance being in the class) to estimate the probabilities using the log-likelihood method.

Transformed target variable:

(3)
\begin{align} log\left(\frac{Pr[1|a_1,a_2, ... , a_k]}{1-Pr[1|a_1,a_2, ... , a_k]}\right) = w_0x_0 + w_1x+1 + ... + w_nx_n \end{align}

The decision boundary for logistic regression is where P = 05. (i.e. when the instance sum when each of the weights is negated is 0).

Because this is a linear equality in the attribute values the boundary is a plane (or hyperplane) in the instance space.

Multi-Response Linear Regression

Speaking of hyperplanes, multi-response linear regression defines a separating hyperplane for each class, and the predicted class will be the one for which the following is true:

(4)
\begin{equation} (w_0^{(j)} - w_0^{(k)}) + (w_1^{(j)} - w_1^{(k)})x_1 + ... + (w_m^{(j)} - w_m^{(k)})x_m > 0 \end{equation}

I.e. the one that sums larger.

However there are sets of points that cannot be separated by a single hyperplane, which means they cannot be correctly classified with logistic regression, or multi-response linear regression.

Robust Regression

Statistical methods that address the problem of outliers are called robust. There are a few possible ways of making regression most robust:

  • Minimise absolute error, rather than squared error
  • Remove the outliers (the top 10% of points furthest from the regression plane)
  • Minimise the median of squared error, rather than mean squared error (copes with outliers from both directions)
    • Find the narrowest strip covering half the observation (thickness measured vertically)
    • The least median of squares lies in the centre of the band.

Screen%20Shot%202012-03-31%20at%203.22.43%20PM.png

Evaluating Numeric Prediction

The most popular way of evaluating how the numeric prediction has gone is with mean-squared error, but we can also use comparisons with independent test sets, cross-validation or a number of other methods.

Other Possible Numeric Error Measures

Root-Mean Squared Error

(5)
\begin{align} \sqrt{\frac{(p_1 - a_1)^2 + ... + (p_n - a_n)^2}{n}} \end{align}

Mean Absolute Error

The mean absolute error is less sensitive to outliers than the mean-squared error

(6)
\begin{align} \frac{|p_1 - a_1| + ... + |p_n - a_n|}{n} \end{align}

Relative Error

Sometimes relative error values are more appropriate (e.g. 10% for an error of 50 when predicting 500).