# Problem of Non-Linear Regression

In Non-Linear regression the target function may have a non-linear dependency between input and output, e.g., the data is not linearly separable.

# Neural Networks

Both of these are on the Artificial Neural Networks page:

- Neural Networks (in terms of Non-Linear Regression)
- Back-Propagation with Sigmoid Function and Hidden Layers

# Regression and Model Tree Approaches for Non-Linear Regression

Decision tree learning uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification trees or regression trees.

## Regression Trees

Regression tree analysis is when the predicted outcome can be considered a real number (e.g. the price of a house, or a patientâ€™s length of stay in a hospital).

### Differences with Regression and Decision Tres

- Splitting criteria is based on minimising variation within subsets, rather than based on correctly classifying an instance
- Pruning criteria is based just on numeric error measure, rather than either minimising error or getting rid of statistically insignificant branches
- Easier to interpret
- Can evaluate piecewise constant functions

## Model Trees (and M5 Prime)

M5prime is a learner that does so by model trees.

Model trees are like regression trees, but each node has a linear regression function as the decision.

Only a subset of the attributes are used for the linear regression function (attributes occurring in the subtree, and maybe attributes occurring in the path to the root).

They're fast; the overhead for linear regression is not large because of the small subset of attributes used.

### Smoothing

tl;dr smoothing is using the linear regression model to work out the weights and then sticking them in the leaf nodes of model trees.

Smoothing is a naive prediction method, it takes the output value of the linear regression model and uses that at the corresponding leaf node.

We can improve performance by *smoothing* predictions with internal Linear Regression models (so the predicted value becomes the weighted average of all the linear regression models along the path from root to leaf.

This same effect can be achieved by incorporating the internal nodes into the leaf nodes.

#### Formula

(1)- p' = prediction passed up to the parent node
- p = prediction passed up from the the node below
- q = value predicted by the model at this node
- n = number of instances that reach the node below
- k = smoothing constant

### Building the Tree

#### Splitting Criteria

We use the *standard deviation reduction* formula as our splitting criterion (as opposed to information gain in decision trees).

where T[1], T[2], etc are the sets from splits of data at the node.

#### Termination Criteria

Our termination criteria (for building the tree) is:

- when the standard deviation becomes smaller than a certain fraction of the standard deviation for the full training set (e.g. 5%)
- too few instances left to bother building new nodes for them (e.g. < 4)

### Pruning the Tree

Pruning is based on minimising the estimated absolute error of the LR models.

Heuristic estimate:

(3)(n = number of training instances that reach the node, v is the number of paramaters in the linear model).

- Linear Regression models are pruned by greedily removing terms to minimise the estimated error.
- Model trees are allowed for heavy pruning - often a single LR model can replace a whole subtree (like in decision tree postpruning).
- Pruning happens from the bottom up; the error for the LR model at an internal node is compared to the error for its subtree - if the subtree's is greater than we ditch the subtree.

### subTree Error

The error of a subtree is a recursive function that works like:

if (node is a leaf) { return it's error. } else { return (numLeftInstances * subTreeError(left)) + (numRightInstances*subTreeError(right)) / (numLeftInstances + numRightInstances); }

So it basically finds the error of a left leafNode, multiplies it by how many instances there are in it, and adds it to the same number for the rightNode. Then divides it by how many instances there are in the two total, at each level adding the result for the leftNode*leftInstances to the rightNode*rightInstances and dividing by total instances of the node. It stops when it hits the current node.

### Discrete/Nominal Attributes

Nominal attributes (sunny/rainy vs 0.6,0.7,0.8 etc) are converted to binary attributes and treated as numeric.

- Nominal values are sorted using the average class value for each one.
- For k-values k-1 binary attributes are generated
- The i-th attribute is 0 if an instance's value is one of the first i in the ordering, 1 otherwise.

Performing a best binary split with the original attribute is probably equivalent to splitting on one of the new attributes. (?)

#### Example

Outlook: Sunny/Rainy/Overcast -> isSunny: 1/0, isRainy: 1/0, isOvercast: 1/0.