Entropy In ML

Information Theory defines the certainty of a decision (1 if completely certain, 0 if completely uncertain, normally in between) as entropy, a probability-based measure used to calculate the amount of uncertainty.

It's also hence a measure of how much information we don't know (how uncertain we are about the data), and can be used to measure how much information we gain from an attribute when the target attribute is revealed to us. (AI says)

Which Attribute is Best?

The attribute with the largest expected reduction in entropy (for the target attribute) is the 'best' attribute to use next. We see this when we talk about information gain.

Because if we have a large expected reduction it means taking away that attribute has a big effect, meaning it must be very certain.

Generalised Formula

From information theory we know that the optimal number of bits to encode a symbol with probability p is − $log_2p$.

Using this we can see that the smallest number of bits on average per symbol needed will be:

\begin{align} H(x) = -p_1 \log_2 p_1 - p_2 \log_2 p_2 ... p_n \log_2 p_n \\ H(x) = - \sum_{j=1}^{n}p_j \log_2 p_j \\ \end{align}

H(x) is known as the entropy of x.

Where x is the Attribute, and each of the p's is a value of x.

We can think of the probability as being:

\begin{align} \frac{|S_i|}{|S|} \end{align}

Where Si is the number of instances where the attribute is the value p, and S is the total number of instances.

Hence Entropy becomes:

\begin{align} - \sum_{j=1}^{n}\frac{|S_j|}{|S|} \log_2 \frac{|S_j|}{|S|} \\ \end{align}


  • Let S be a set of training examples
  • Let p+ be the proportion of positive examples in S
  • Let p- be the proportion of negative examples in S

We saw earlier the formula for optimally encoding the class of an attribute. Given here there are just two classes, we can see that the the entropy is:

\begin{align} Entropy(S, A) \equiv -p+ \log_2 p+ - p- \log_2 p- \end{align}

Where by default (i.e. if left out) the Attribute we're testing is the Target Attribute, which is either a yes/no value.

We call this uncertainty "impurity", and a "pure" set of training examples is one where all examples are of the same 'class' (positive or negative).

Entropy must always be positive, as completely pure is 0, and we can't ever be more pure than pure.

Multi-Stage Decision Property

\begin{align} Entropy([p,q,r]) = Entropy([p,q+r]) + (q+r)*Entropy([\frac{q}{q+r},\frac{r}{q+r}]) \end{align}

Where p + q + r = 1

(Hence if given say p=2, q= 3, r = 4, we use Entropy(2/9,3/9,4/9)).

Transmitting Bits

A Place Where Entropy is Used

How can we encode possibilities (e.g. X is either A, B, C or D) as binary strings, in order to send a stream of these (e.g. ABCAADABC) using as few bits as possible?

Four Probabilities All Equal

Here we can very simply set:

  • A = 00
  • B = 01
  • C = 10
  • D = 11

This makes it very easy to do random accessing of the xth possibility, and it has a 2-byte average.

Three Probabilities All Equal

We could simply do:

  • A = 00
  • B = 01
  • C = 10

But that's naive. We can actually do this in 1.6 bits average. How?

  • A = 0
  • B = 10
  • B = 11

This means we can't randomly access the xth possibility, but that 1/3 of the time we only use 1 byte, saving us transmission space.

Four Probabilities All Different

If we have all different probabilities we might split it up as:

  • A (1/2 prob) = 0
  • B (1/4 prob) = 10
  • C (1/8 prob) = 110
  • D (1/8 prob) = 111

and this is 1.75 bits per transmission.

The Generalised Formula

Using the generalised formula above we can optimally encode any number of bits.