Decision Trees In ML


Decision Trees are the most popular Data Mining tool. They are classifiers. They predict categorical outputs from categorical/real inputs.


  • Easy to understand
  • Easy to implement
  • Easy to use
  • Computationally Cheap


Overfitting is one drawback of decision trees.

Decision Tree Representation

  • Each internal node tests an attribute
  • Each branch corresponds to attribute value
  • Each leaf node assigns a classification

Tree Notation

There are many formats we can use to depict trees. The most common is the basic visual tree diagram. Aside from that we can use other more graphical forms, or the text-based C4.5.

Each level of the tree is a different attribute, and each node at that level depicts a different value of the attribute (known as its class).

A node shows the probability of its value (of the current attribute level) leading to a positive or negative outcome. (e.g. Will having defaulted on a loan lead to a default on a credit card).


  • We read this top-down and left-to-right
  • The same level indentation indicates that the nodes are children of the same degree.
  • Indentation directly underneath a node means that the next line is its child.

When to Use

  • Instances describable by attribute–value pairs
  • Target function is discrete valued
  • Disjunctive hypothesis may be required
  • Possibly noisy training data


  • X OR Y
  • Calendar Scheduling

Top-Down Induction of Decision Trees (TDIDT)

This is an algorithm for generating these trees. It's very similar to the ID3 algorithm, and often referred to as that.


  1. Let A be the "best" decision attribute for the next node (as in, which attribute should I test next)
  2. Assign A as the decision attribute for the next node
  3. For each value of A, create a new descendent of the node
  4. Sort training examples to leaf nodes (e.g. if training example has weather as sunny then that example is sorted into the 'sunny' node, rainy ones will be put into the 'rainy' node).
  5. Recurse/Iterate over new children until all are perfectly sorted.


TDIDT is the way of searching a complete hypothesis space. It uses a preference bias, and has inevitable overfitting (hence requires pruning).

There's been decades of research into refining it (e.g. dealing with continuous values and attributes with costs).

Performance of TDIDT can be improved with ensemble methods.

Attributes with Costs

Examining an attribute can have a cost. E.g. a BloodTest may have a cost of $150, or a movement may costs 23seconds.

We want to have a low expected cost when we learn a tree. To do this we need to evaluate gain relative to cost.

We can do this in a few different ways. E.g.

\begin{align} Gain = \frac{Gain^2(S,A)}{Cost(A)} \end{align}

Information gain in ML.

Unknown Attribute Values

What do we do if some instances are missing a value for an attribute A?

Here are three approaches:

  • If a node n tests A, assign to it the most common value found in the examples in node n
  • Assign the most common value of A among the other examples with the same target value
  • Assign probability $p_i$ to each possible value $v_i$ of A (based on other examples). Assign cases the different values correctly based on their probabilities.


When the training set is too large for memory we can use ID3's windowing.

  1. Let the Window be a subset of the instances
  2. Construct the decision tree from all instances in the window
  3. Use this tree to classify the rest of the instances
  4. If all the instances are all correctly classified then stop
  5. Else add selected misclassified instances to the window and repeat the construction.

This is related to ensemble learning.

See ID3 Hypothesis Space Search