Reinforcement Learning

Reinforcement Learning is where the agent is not presented with target outputs, but is given a reward signal, which it aims to maximize.

There are many applications for which it is difficult, inappropriate, or even impossible to provide a “training set”.

  • optimal control
    • mobile robots, pole balancing, flying a helicopter
  • resource allocation
    • job shop scheduling, mobile phone channel allocation
  • mix of allocation and control
    • elevator control, backgammon

Reinforcement Learning Framework

  • An agent interacts with its environment.
  • There is a set S of states and a set A of actions.
  • At each time step t, the agent is in some state $S_t$. It chooses an action $a_t$ and then goes to the next state $S_{t+1} = \delta(S_t, a_t)$ and receives a reward of $r(S_t, a_t)$.
  • reward and delta can be multi-valued
  • The aim is to find an optimal policy $\pi : S \rightarrow A$ which will maximise the cumulative reward.

Models of Optimality

To work out if a fast nickel is worth a slow dime (reward now vs bigger reward later), we can use the following methods:

Finite Horizon Reward

Computationally simple.

\begin{align} \sum^h_{i=0}r_{t+i} \end{align}

Average Reward

Hard to deal with because we can't sensibly choose between a small reward soon and a large one very far in the future.

\begin{align} lim_{h \rightarrow \infty} \frac{1}{h}\sum_{i=0}^{h-1} r_{t+i} \end{align}

Infinite Discounted Reward

Easier for proving theorems.

\begin{align} \sum^{\infty}_{i=0} \gamma ^i r_{t+i}, 0 \leq \gamma < 1 \end{align}

Comparing Models of Optimality


Delayed Reinforcement


If you start at 2 the best policy is to stay there and take the green arrow.
If you start at 1 with a low discount factor then it's better to stay at 1, otherwise go to 2.

Exploration Exploitation Tradeoff

Most of the time we should choose what we think is the best action. However, in order to ensure convergence to the optimal strategy, we must occasionally choose something different from our preferred action, e.g.

  • Choose a random action 5% of the time, or
  • Use a Bolzmann distribution to choose the next action:
\begin{align} P(a) = \frac{e^{\frac{\vec{V}(a)}{T}}}{\sum_{b \in A} e^{\frac{\vec{V}(b)}{T}}} \end{align}

Temporal Difference Learning (TDL)

In TDL we expect that over time we have better information rather than worse. It's where the (discounted) value of the next state, plus the immediate reward, is used as the target value for the current state.

A more sophisticated version, called TD(λ), uses a weighted average of future states. This can sometimes help, but isn't always needed.

The problem with TD learning is having to learn a value for every state. For problems with very large numbers of possible states it becomes in feasible.

TD(0): use $V_{k+1}$as the training value for $v_k$

$TD(\lambda )$: use $T_k$ as the training value for $V_k$ where:

\begin{align} T_k = (1 - \lambda ) \sum^m_{t = k + 1} \lambda^{t - 1 - k} V_t + \lambda^{m - k} V_{m+1} \end{align}


Q-Learning is a variation on T-Learning, where instead of learning a value function we learn a 'Q' function.

For each state s in the set of states S, let V*(s) be the maximum discounted reward obtainable from s, and let Q(s,a) be the discounted reward available by first doing action a and then acting optimally.

A policy has to be prepared to start at any state, and hence we have to learn a value for not only every state, but for every action with every state.


If $0 \leq \gamma < 1$

\begin{align} x = reward_0 + \gamma reward_1 + \gamma^2 reward_2 + \gamma^3 reward_3 + ... + \gamma^n reward_n \end{align}

If all rewards are 1:

\begin{align} \gamma x = \gamma + \gamma^2 + \gamma^3 + ... + \gamma^n \end{align}
\begin{align} x(1 - \gamma) = 1 x = \frac{1}{1 - \gamma} \end{align}

Example: Backgammon


\begin{align} w \leftarrow w + \eta (T-V_\frac{\delta V}{\delta w} \end{align}

The different learnings choose T differently:

  • We could learn moves from example games
  • We could put T as the final outcome of the game
  • Could use Temporal Difference Learning

It worked for backgammon because:

  • random dice rolls in Backgammon force self-play to explore a much

larger part of the search space than it otherwise would.

  • humans are bad at probabilistic reasoning?

* evolutionary algorithm can also produce a surprisingly strong player, but a gradient-based method such as TD-learning is better able to fine-tune the rarely used weights, and exploit the limited non-linear capabilities of the neural network.

Limits of Theoretical Results

  • Delayed reinforcement
    • Reward resulting from an action may not be received until several time steps later, which also slows down the learning
  • Search space must be finite
    • convergence is slow if the search space is large
    • Relies on visiting every state infinitely often
  • For “real world” problems, we can’t rely on a lookup table
    • Need to have some kind of generalisation (e.g. TD-Gammon)
    • It only knows about a state if it's been there, and when there are infinite states (chess) then it's impractical