Classification Rule Learning In ML

Users vs Specialists

In Machine Learning, a User may want to see a graph/table/other simple representation of the data as a model. This will help them quickly see patterns and trends, and get a basic understanding and feel for the overall data.

Once they've achieved that, the specialist can move onto their representations (neural networks, decision trees, etc) to display a better predictive accuracy.

Decision Tables

Decision Tables (as in, a table model - the way the data was inputted) are a simple way for users to see data. We can do a simple lookup of the attribute to find its value.

This uses rote learning and memorisation - there's no generalisation or actual learning.

We define a schema as a set of attributes. If we take a subset of the attributes (a schema) then we can compress the table and classify new instances.

A body is a multiset (set with repetitions) of labelled instances (each has a value for all the attributes and the target attributes).

Learning Decision Tables

Let:

(1)
\begin{align} i = 0 \text{(counter)}\\ A_i = A \text{(attribute set)}\\ S_i = \emptyset \text{(schema)}\\ E_i \text{(cross-validation estimation of error over the set)} \end{align}

While $E_i$ is decreasing:

  • Find the best $a \in A_i$ (based on reducing $E_i$)
  • Take a out of $A_i$ and add it to $S_i$
  • Increment i (i += 1)

Cross-Validation

Cross-validation estimation of error essentially means partitioning the data into different sets and estimating the error in each one, and then averaging that. It's meant to help figure out an overall estimation of error.

Methods of Cross-Validation in ML

Stratified Cross Validation

The idea behind stratification is to make the distribution of positive and negative instances in each set/fold/partition as close as possible to the global distribution.

Rule Representations

(2)
\begin{align} Antecedent \rightarrow Consequent\\ \text{(If Antecedent then Consequent)} \end{align}
  • The Antecedent (pre-condition) is a series of tests or constraints on attributes (like the tests at decision tree nodes).
  • The Consequent (post-condition/conclusion) gives the class value or probability distribution on the class values (like the leaf nodes of a decision tree).

These rules (ones with a single consequent) are classification rules. The antecedent is true if it does in fact logically imply the consequent.

Rule Conflicts

Different rules can clash and give rise to conflicts. Hence we need to have a case covering these, either:

  • give no conclusion
  • give one rule priority to override the others.

We also have the problem of some instances not being covered by rules, in which case we can either:

  • give no conclusion
  • give the training set's majority class.

e.g.

  • Rule 1: If Attr1 = Sunny ^ Attr2 = Hot, then Yes
  • Rule 2: If Attr2= Hot ^ Attr3=Humid, then No

Instance of <Sunny, Hot, Humid> classifies it as both yes and no.

Rules vs Trees

Trees are essentially if/else if/else rules, with a fixed order of execution.

However rules are independent bits of information, whereas trees are very hard to split up like that. Rules can also be more compact than trees.

1-Rule (1R) Approach

A very simple rule-learner has the form of a one-level decision-tree, expressed as a set of rules (each testing one attribute).

Algorithm

  • For each attribute a
    • For each value v of a make a rule (by doing the following:)
      • Count how often each class appears, and find the most frequent class, c
      • Set a rule to assign the class c for a = v
    • Calculate the error rate of each rule for a
  • Choose the set of rules with the lowest error rate

Example

Attributes

  • Outlook (error = 4/14)
    • Sunny -> No
    • Overcast -> Yes
    • Rainy -> Yes
  • Temperature (error = 5/14)
    • Hot -> No
    • Mild -> Yes
    • Cool -> Yes
    • High -> No

Hence we would pick the Outlook rules, as they have 10/14 correctly classified instances.

Complications

Missing or Numeric values for attributes are more complicated. A usual strategy is to treat 'missing' as a distinct value, and to break the numeric attributes into different blocks based on breakpoints.

Zero-R

This is the same as 1R, but we test zero attributes rather than one. Meaning that we predict the majority class in the whole training set, rather than for each value of each attribute. It's used as a baseline for comparing classifications.

It's the most general classifier - e.g. majority say we always play sport, so we'll classify all rules as that. It's not normally useful, but it's a good basis to start from.

Learning Disjunctive Sets of Rules

(Learning sets of rules that don't overlap, as in, 5 rules to classify something).
When we're trying to learn rules we can do a couple of things:

  • Learn a decision tree and convert that to rules
    • Slow for large/noisy data sets
  • Sequential Covering Algorithm
    • Essentially:
      • Learn one super highly accurate rule (any number of instances affected)
      • Remove the instances it correctly classifies
      • Repeat

Sequential Covering Algorithm

SequentialCovering(TargetAttribute,Attributes,Examples,Threshold):

  • Rule = LearnOneRule (TargetAttribute,Attribute, Examples)
  • While Performance(Rule, Examples) > Threshold:
    • LearnedRules += Rule
    • Examples -= CorrectlyClassifiedExamples
    • Rule = LearnOneRule (TargetAttribute,Attribute, Examples)
  • Sort LearnedRules (according to performance over examples - all examples, I think)
  • Return learned rules

LearnOneRule Algorithm

Find a single rule that finds some positive and no negative examples.
E.g. If Wind=Strong then PlayTennis=yes.

LearnOneRule (TargetAttribute,Attribute, Examples):

* If Performance(newRule) > Performance (BestRule) then BestRule = NewRule
* NewRuleNeg = subset of the negative examples that NewAnte correctly classifies.

  • While there are positive examples:
    • newAntecedent = most general rule antecedent possible
    • While there are negative examples:
      • For each class value of the target attribute:
        • NewConstraints = targetAttribute must be the current class value (e.g. a = +)
          • // We're specialising NewAntecedent by generating possible new constraint literals
        • Generate new candidates, and set the best candidate to be the max of them
          • best being based on the performance of the rule:
          • Performance(SpecialiseAntecedent(NewAntecedent, best candidate) -> NewConstraints).
        • NewAntecedent += BestCandidate
        • NewRule = NewAntecedent -> NewConstraints
        • If Performance(newRule) > Performance (BestRule) then BestRule = NewRule
        • Set negatives to only include the negative examples that match the newAntecedent

The LearnOneRule is a 'covering' approach - at each stage we find a rule that covers some of the instances.

The Performance(Rule) function is not defined, but a simple one would be to check the number of negatives not covered by the antecedent (Negatives - NewRuleNeg). The consequent could then be the most frequent value of the target attribute among examples covered by the antecedent, so this is not the best measure of performance.

Example

Screen%20Shot%202012-03-22%20at%2011.21.33%20AM.png

Subtleties

  • It may use beam search
  • Generalises to multi-valued target functions
  • We need to choose the evaluation function to guide the search (how will we measure its progress?):
    • Entropy
    • Sample Accuracy: $\frac{n_c}{n}$ where nc is the correct rule predictions, and n is all predictions
    • m estimate: $\frac{n_c + mp}{n + m}$ - an approximation to a Bayesian evaluation function. <- what is this?

++ Aspects of Sequential Covering Algorithm
Sequential covering learns rules singly (learn one rule), whereas Decision Tree induction learns all disjuncts simultaneously. (i.e. considers all the data all the time and learns the rules that have been applied at the end).

Sequential covering chooses between all attribute-value pairs at each specialisation, whereas Decision Tree induction only chooess between all the attributes, to work out what's best to add.

Assuming the final rule set contains an average of n rules with k conditions, sequential covering is nk selection decisions. This is more flexible with choosing attribute-value pairs, and can be good on large data sets.
In a decision tree we can pick an attribute at an internal node and this is equivelent.

If a general-to-specific search is chosen, then it's easy to start from a single node, and then use a generate-and-test approach to test all possible specialisations.
If it's a specific-to-general search we first have to determine the starting nodes, and then constrain hypotheses from the examples we have.

Entropy, m-estimate, relative frequency and significance tests can be used to determine performance evaluation.

Rules and Exceptions

Exceptions can be added quite easily to rules to allow for incremental modification to esaily incoprorate new data.
These work quite well because people often think in terms of exceptions.

E.g. the grass is green unless there's been a fire, then brown.

We can have exception-ception (exceptions within exceptions).

Induct Ripple-Down Rules (RDR)

A random rule R selects t cases at random from the data set. We then use a hypergeometric distribution to determine the probability that p of these belong to the correct class. It works well if the target function suits rules-with-exceptions.

Screen%20Shot%202012-03-28%20at%203.35.27%20PM.png

Questions When Devising the Classification Rule Learning Program

  • Sequential or simultaneous covering of data?
  • General → specific, or specific → general?
    • Hence Generate-and-test, or example-driven?
  • Whether and how to post-prune?
  • What statistical evaluation function to use?