Uncertainty

Previously we've used logic to represent facts and used planning algorithms to identify desirable actions. Now we reevaluate this approach by introducing uncertain information.

Probability theory provides the basis for our treatment of systems that reason under uncertainty. Also, because actions are no longer certain to achieve goals, agents will need ways of weighing up the desirability of goals and the likelihood of achieving them. For this, we use utility theory. Probability theory and utility theory together constitute decision theory, which allows us to build rational agents for uncertain worlds.

Uncertainty can also arise because of incompleteness and incorrectness in the agent's understanding of the properties of the environment.

Uncertainty Example

Let action A(t)= leave for airport t minutes before flight

Will A(t) get me there on time? Problems:

  • partial observability, noisy sensors
  • uncertainty in action outcomes (flat tire, etc.)
  • immense complexity of modelling and predicting traffic

Hence a purely logical approach either

  1. risks falsehood: “A(25) will get me there on time”, or
  2. leads to conclusions that are too weak for decision making:

“A(25) will get me there on time if there’s no accident on the bridge and it doesn’t rain and my tires remain intact etc etc.”
(A(1440) might be safe but I’d have to stay overnight in the airport …)

Other plans, such as A(50), might increase the agent's belief that it will get to the airport on time, but also increase the likelihood of a long wait.

Methods for Handling Uncertainty

Default/Non-monotonic Logic

Make assumptions - e.g. A(i) works unless contradicted by evidence
The issue is what assumptions are reasonable, and how do we handle contradictions.

Probability

Use the 1565 theory of Gambling to work out the probability of A(i) being correct, given the available evidence.

Using first order logic fails to cope in uncertain situations because:

  • Laziness: failure to enumerate exceptions, qualifications, etc. (e.g. don't write down all symptoms of a disease)
  • Theoretical Ignorance: lack of knowledge about environment at all (e.g. know nothing about a disease)
  • Practical Ignorance: lack of relevant facts, initial conditions, etc. (e.g. don't know the patient's specific results for a test).

Probability allows us to summarise the uncertainty that comes from laziness and ignorance.

What is the 'right' thing to do?

The right thing to do, the rational decision, depends on both the relative importance of various goals and the likelihood that they will be achieved (and to what degree).

e.g. depends on preferences for missing flights vs long waits.

Utility theory is used to represent and infer preferences
Decision theory = utility theory + probability theory

Subjective/Bayesian Probability

These probabilities relate propositions to one’s own state of knowledge. These are not claims of a “probabilistic tendency” in the current situation (but might be learned from past experience of similar situations). Probabilities of propositions change with new evidence:
e.g.

  • P(A(25)|no reported accidents) = 0.06
  • P(A25|no reported accidents, 5 a.m.) = 0.15

Probability Basics

Start with a set S - the sample space (e.g. [1,2,3,4,5,6] = the 6 possible rolls of a die).

$s \in S$ is a sample point/possible world/atomic event.

A probability space/model is a sample space where every s has an assignment (P(s)).

Rules:

  • $0 \leq P(s) \leq 1$
  • $\sum_s P(s) = 1$
    • e.g. $P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = \frac{1}{6}$

An event A is any subset of S.

(1)
\begin{align} P(A) = \sum_{s \in A} P(s) \end{align}

e.g.

(2)
\begin{align} P(\text{die roll < 4 }) = P(1) + P(2) + P(3) = \frac{1}{6} + \frac{1}{6} + \frac{1}{6} + \frac{1}{6} = \frac{1}{2} \end{align}

Random Variable Functions

A random variable is a function from sample points to some range (e.g. the Reals or Booleans).

Example

Odd(3) = true.
Odd() is the function, 3 is the sample point, true is the boolean value that is the range

Probability Distribution

P induces a probability distribution for any random variable X:

(3)
\begin{align} P(X = x_i) = \sum_{s:X(s) = s_i} P(s) \end{align}

e.g.

(4)
\begin{align} P(Odd(diceRoll) = true) = P(1) + P(3) + P(5) = \frac{1}{6} + \frac{1}{6} + \frac{1}{6} + \frac{1}{6} = \frac{1}{2} \end{align}

Propositions

A proposition is the event (the set of sample points) for which the proposition is true.

Given boolean random variables A and B:

  • event a = set of sample points where A(s) = true
  • event ¬a = set of sample points where A(s)= false
  • event a^b = set of sample points where A(s) = true and B(s) = true

With boolean variables, sample point = propositional logic model. (Where a proposition is a disjunction of atomic events in which it is true).

Why use probability?

The definitions imply that certain logically related events must have related probabilities, and we can show that an agent who bets according to probabilities that violate these axioms can be forced to bet so as to lose regardless of the outcome.

I.e. we need to follow the axioms.

Syntax for Propositions

Functions have a capital letter if they're multivalued. e.g.

Cavity = <true, false> vs cavity (implies it is true as only option)

Propositional/Boolean Random Variables

e.g. Cavity = <true, false>

Discrete (finite/infinite) Random Variables

e.g. Weather = <sunny, rain, cloudy, snow>

Continuous (bounded/unbounded) Random Variables

e.g. Temp = 21.6, or Temp < 22.0

Prior Probability

Prior or unconditional probabilities of propositions are the initial probabilities that are most accurate based on a lack of new evidence.

e.g. P(Cavity = true) = 0.1 and P(Weather = sunny) = 0.72 correspond to belief prior to arrival of any new evidence.

A probability distribution gives values for all possible assignments:
P(Weather) = <0.72,0.1,0.08,0.1> (normalised so as to sum to 1).

Joint Probability

A joint probability distribution for a set of Random Variables gives the probability of every atomic event (sample point) on those Random Variables.

Screen%20Shot%202012-05-03%20at%2010.13.33%20PM.png

Every question about a domain can be answered by the join distribution because every event is just a sum of sample points.

Probability for Continuous Variables

With continuous variables we express distribution as a parameterised function.

e.g. P(X=x) = U[18,26](x) = uniform density between 18 and 26.

Screen%20Shot%202012-05-03%20at%2010.28.09%20PM.png

Here P's a density and hence integrates to 1.
P(X = 20.5) = 0.125 really means

(5)
\begin{align} \frac{lim_{dx \rightarrow 0}P(20.5 \leq X \leq 20.5 + dx)}{dx} = 0.125 \end{align}

Gaussian Density

Gaussian functions describe normal distributions (commonly known as a bell curve).

Screen%20Shot%202012-05-03%20at%2010.32.51%20PM.png

(6)
\begin{align} P(x) = \frac{1}{\sqrt{2 \pi \sigma}}e^{\frac{-(x-\mu)^2}{2 \sigma^2}} \end{align}

Conditional Probability

(7)
\begin{align} P(a|b) = \frac{P(a \wedge b)}{P(b)} \text{ if } P(b) \neq 0 \end{align}

Conditional or posterior probabilities (i.e. P(x|y) (x given y).
e.g. P(cavity|toothache) = 0.8

More general observations remain valid in the face of evidence, but are not always useful.

New evidence might be irrelevant and hence allow simplification. This kind of inference is crucial.
e.g. P(cavity|toothache, 49ersWin) = P(cavity|toothache) = 0.8

Notation

  • Conditional Distribution: P(X,Y) = numX-elements vector of numY-elements vectors
  • When we know that one of the attributes is given (i.e. it = a) P(x,y, a) = 1 (e.g. P(x, y and a)

Chain Rule

Chain rule says that:

(8)
\begin{equation} P(X_1, ..., X_n) = P(X_1, ..., X_{n_1})P(X_n|X_1, ..., X_{n-1}) \end{equation}

And this can be successively applied to get:

(9)
\begin{align} P(X_1, .. X_n) = \product_{i=1}^n P(X_i|X_1, ... X_{i-1}) \end{align}

Inference by Enumeration

Screen%20Shot%202012-05-03%20at%2011.09.54%20PM.png

We can also compute conditional probabilities, such as:

(10)
\begin{align} P(\neg \text{cavity}|\text{toothache}) = \frac{P(\neg \text{cavity ^ toothache}}{P(toothache)} = \frac{0.016 + 0.064}{0.108+0.012+0.016+0.064} = 0.4 \end{align}

Normalisation

General idea: compute distribution on query variable by fixing evidence variables and summing over hidden variables

Independence

A and B are independent iff

(11)
\begin{align} P(A|B) = P(A) \text{ or } P(B|A) = P(B) \text{ or } P(A,B) = P(A)P(B) \end{align}

Screen%20Shot%202012-05-03%20at%2011.25.24%20PM.png

Absolute independence is powerful but rare.

Conditional Independence

If P(X|Y, z) = P(X|z) and P(X|Y, ¬z) = P(X|¬z) then X is conditionally independent of Y given Z.
(i.e. P(X|Y, Z) = P(Z)).

If we let P(Toothache,Cavity,Catch) have $2^3 −1 = 7$ independent entries, and we say that Catch (the probe pokey thingy catching) is conditionally independent of Toothache given Cavity, then we can work out the full joint distribution as:

Screen%20Shot%202012-05-04%20at%202.07.54%20PM.png

The use of conditional independence commonly reduces the size of the representation of the joint distribution from exponential to linear (over the number of things in the original P()).

Bayes Rule is linked to conditional independence - our chain rule'd joint distribution is an example of a naive Bayes model.

Screen%20Shot%202012-05-04%20at%202.32.48%20PM.png

Wumpus World

Screen%20Shot%202012-05-04%20at%202.38.40%20PM.png

Pits are placed randomly, with a probability of 0.2 per square.

We know the following:

  • $breeze = \neq breeze_{1,1} \wedge breeze_{1,2} \wedge breeze_{2,1}$
  • $\text{Known }= \neq pit_{1,1} \wedge \neq pit_{1,2} \wedge \neq pit_{2,1}$
  • Query is $P(Pit_{1,3}|Known, breeze)$
  • Unknown = all Pit[i,j]'s other than Pit[1,3] and Known.

Hence by enumeration we have:

(12)
\begin{align} P(Pit_{1,3}|Known, breeze) =\alpha \sum_{Unknown} P(Pit_{1,3}, Unknown, Known, breeze) \end{align}

This grows exponentially with the number of squares.

Using Conditional Independence

The idea here is that observations (what's known and unknown) are conditionally independent of other hidden squares, given neighbouring hidden squares.

Screen%20Shot%202012-05-05%20at%202.13.47%20PM.png

We can hence define Unknown = Fringe ∪ Other
$P(breeze|Pit_{1,3},Known,Unknown) = P(breeze|Pit_{1,3}, Known, Fringe)$

There's a bunch of maths in manipulating this, but it essentially comes down to:

(13)
\begin{align} P(Pit_{1,3}|Known, b) = \alpha ' P(Pit_{1,3}) \sum_{Fringe} P(breeze|Known,Pit_{1,3} , Fringe)P(Fringe) \end{align}

Applying this we get the following:

Screen%20Shot%202012-05-05%20at%202.20.06%20PM.png

$P(Pit_{1,3}|Known,breeze) = \alpha '(0.2(0.04+0.16+0.16), 0.8(0.04+0.16))$
Comes to about a Probability if 0.31 that there's a pit there, and Probability 0.69 that there isn't.

With $P(Pit_{2,2}|Known,breeze)$ we get a Probability of 0.86 that there's a pit there, and a probability of 0.14 that there isn't.