Association Rule Learning In ML

Data Mining for Associations is all about figuring out the links between different actions. If in a supermarket people buy milk when they buy cheese and bread 90% of the time, then we would make the following rule:
(buy(bread) ^ buy(cheese) -> buy(milk)).

The confidence factor of the above association rule would be 90%, as explained below.

Association Rules

Association rules are similar to classification rules, except that:

  • Any attribute (rather than a value of a specific attribute we're looking at) can appear in the consequent
  • There can be >= 1 there.

This means we can generate very many association rules , and hence use restrictions on the coverage and accuracy of rules, plus syntactic restrictions, to dramatically reduce the number of “interesting” and “potentially useful” rules generated.

Let Z be a set of items (e.g. set of products at a store, or the set of attribute-value pairs that describe a dataset).


Each item is an attribute-value constraint (boolean yes/no for does it exist in the itemset).


An itemset I is some subset of the items in Z


A transaction T is a pair of transaction identifiers and itemsets.

\begin{equation} T = (tid, I) \end{equation}

Transactions support (verb) all itemsets of I.

i.e. if I = "Bread, Milk, Cheese" and X = "Bread, Milk" then X is a subset of I and hence T = (3, "Bread, Milk, Cheese") supports X.

Transaction Database

A transaction database D is just a set of transactions (each with a unique transaction identifier).


The cover of an itemset in a database is the set of transaction identifiers of th transactions that support the itemset.

E.g. if X = "Bread, Milk" then the cover of (X,D) is the set of all transaction identifiers for transactions containing "Bread, Milk".

\begin{align} cover(X,D) := {tid| (tid, I) \in D, X \subseteq I} \end{align}


The support (noun) or coverage of an itemset is the number of transactions in the cover of the itemset.

I.e. Support(X,D) = numItems in cover(X,D).

So it's the number of transactions in the dataset that contain the itemset, and it's often expressed as the ratio of the number of cases that contain X out of the total dataset. (I.e it's fucking probability).

If the support(X,D) is higher than thresholdSupport (some minimum set each time) then we call the itemset frequent or large.

It's essentially the number of transactions/instances which the rule correctly predicts.

\begin{align} Support(X \rightarrow Y) = P(X) * P(Y) \end{align}

e.g. if probability of X = 4/10, and probability of Y = 2/10 then Support (X -> Y) = 0.08


The confidence or accuracy of a rule $X \rightarrow Y \in D$ is the proportion of transactions in the cover of X in D which are also in the cover of X and Y in D.

It's essentially the percentage of transactions/instances that were predicted correctly (out of all predicted instances).

e.g. if we see Milk 5 times out of 100, but Bread and Milk is seen 4 then the confidence is 4/5.

Measuring Association Rules


Like in Probability Theory where the probability of X given Y is the number of X^Y cases over the number of Y cases, the confidence of X -> Y is the support of X^Y over support of X.

\begin{align} P(X|Y) = \frac{Support(X and Y)}{Support(Y)} \end{align}

Hence probability of finding the consequent (X and Y) out of the cases that have the antecedent (Y).


The lift of a rule is just ratio of how many times both X and Y do occur over the expected number of times that X and Y should occur.

\begin{align} \frac{\text{actual number of X and Y}}{\text{average number of X and Y}} \end{align}

So it's the deviation of the support of the rule from what we expect it to be - the higher the better.

\begin{align} lift(X \rightarrow Y) = \frac{support (X \wedge Y)}{support(X) * support(Y)} \end{align}

So {X^Y} means find both X and Y in the same instance, not 'and' or 'or' or the like.

Because in order to have X occur it's P(X), and then Y on top is P(X) * P(Y), hence that's the expected. (Support(X) = P(X)).

Item Sets to Association Rules

We can have lots of different sets:

  • one-item sets (individual items with their attribute-value constraints)
  • two-item sets (pairs of one-item sets - as long as different attributes)
  • three-item sets (etc)

All itemsets have a support on the data set (number of instances that contain the itemset), and any itemsets below the minimum support threshold are discarded in favour of "large"/"frequent" itemsets

We generate association rules with a certain minimum confidence from each item set.


3-item set with support 4 can lead to seven different potential rules:

humidity = normal, windy = false, play = yes

Rule If Rule Then Confidence
If humidity = normal and windy = false then play = yes 4/4
If humidity = normal and play = yes then windy = false 4/6
If windy = false and play = yes then humidity = normal 4/6
If humidity = normal then windy = false and play = yes 4/7
If windy = false then humidity = normal and play = yes 4/8
If play = yes then humidity = normal and windy = false 4/9
If anything then humidity = normal and windy = false and play = yes 4/14

Efficient Generation of Item Sets

Two stage algorithm:

  • Generate item sets with specified minimum support (coverage)
  • Generate association rules with specified minimum confidence (accuracy) from each item set

Key Idea

A k–item set can only have minimum support if all of its k−1-item subsets have minimum support (are “large”).

  • First, generate all 1-item sets by making a pass through the data set and storing all those above minimum support in a hash table
  • Next, generate all 2-item sets from all pairs of 1-item sets from the hash table, calculate the support and discard those below minimum support.
  • Iterate and repeat for k-items until we have no more large itemsets

Apriori Algorithm

// L[n] = {large n-itemsets}
// C[n] = {all n-itemsets}
for (k = 2; L[k-1] != empty; k++) {
   // Generate new k-itemsets
   C[k] = apriori-gen(L[k-1])
   for (t = 0; t < numTinD; t++) {
      //Ct[t] = the itemsets in C[k] that are in the transaction
      Ct[t] = subset(C[k], t)
      for (c = 0; c < itemSets in Ct[t]; c++) {
         // Increment the number of times we see each of the candidate itemsets
   L[k] = {the itemSets in C[k] that have a count greater than minimum support}

About Apriori

Apriori (the best known data mining rule) uses the downward-closure property of support (an antimonotonic heuristic), which just says that for all frequent itemsets, each of their subsets is frequent.

Hence the other way around it is true that if any size k pattern is not frequent in the database its size k+1 super-pattern can't be frequent.

This counteracts the way the powerset grows exponentially by cutting down the number of sets at each layer.


Suppose we have five three-item sets: (ABC), (ABD), (ACD), (ACE) and (BCD) where A is a feature like outlook = sunny.

Let (ABCD) be the union of the first two three-item sets.
It is a candidate four-item set since all of its three-item subsets are large. What are they?

(ABC), (ABD), (ACD) and (BCD).

Assuming the three-item sets are sorted in lexical order, we only have to consider pairs with the same two first items; otherwise the resulting item set would have more than four items.

These are (ABC) and (ABD) plus (ACD) and (ACE), and we can forget about (BCD).
The second pair leads to (ACDE).

BUT all of the three-item subsets of (ACDE) are not large … What are they ?

(ACD) and (ACE) are large, but (ADE) and CDE) are not large. Therefore, we only have one candidate four-item set, (ABCD), which must be checked for actual minimum support on the data set.

Generating Association Rules

Key Idea In Generating Association Rules

If we have a rule $\text {A and B \rightarrow C and D}$ with above minimum support and minimum confidence.

This is a double-consequent rule (two consequents).

Now if we look at the two single-consequent rules:

$A \wedge B \wedge C \rightarrow D$

$A \wedge B \wedge D \rightarrow C$

Then based on the above double-consequent rule we can know that they must have greater than minimum support and confidence.


The single and double rules have the same support (the support is all the variables in the antecedent and consequent, regardless of which one they are in), which is hence greater than minimum.

Confidence = Support (Antecedent and Consequent) / Support (Antecedent).

The single-rules also have a bigger confidence than the double rules, as the double-consequent rule has a smaller (more general) antecedent, which hence matches more things (bigger support), which hence leads to a smaller confidence (given they have the same numerator).


The converse, a single-consequent rule which can form a double-consequent rule, that has less-than-minimum confidence, must obviously not lead to a more-than-minimum confidence, so we don't bother expanding it.

This gives a basis for an efficient algorithm for constructing rules by building up from candidate single-consequent rules to candidate double- consequent rules, etc.


  • Generate all single-consequent rules
  • Discard any below minimum confidence
  • Build up the potential double-consequent rules from the remaining single-consequent rules
  • Repeat
// L[n] = {large n-consequent rules}
// C[n] = {all n-consequent rules}
for (k = 2; L[k-1] != empty; k++) {
   // Generate new k-consequent rules
   C[k] = apriori-gen(L[k-1])
   for (t = 0; t < numNewRules; t++) {
      //Ct[t] = the transactions we're looking at
      if (Ct[t] matches the antecedent) {
         if (Ct[t] also matches the consequent) {
   L[k] = {the consequent rules in C[k] with confidence greater than minimum}
   // confidence =numConsequents/numAntecedents

Other Algorithms

Other algorithms include ECLAT and FP-Growth, which use more complex data structures to reduce search complexity.