A Markov process formally describes an environment for planning and learning
Where the environment is fully observable
I.e. The current state completely characterises the process (the Markov condition)
“The future is independent of the past given the present”
Definition: Markov Property
A state \(S_t\) is Markov if and only if \[ \mathbb{P} [S_{t+1} | S_{t} ] = \mathbb{P} [S_{t+1}\ |\ S_0, \ldots, S_t ] \]
The state captures all relevant information from the history
Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future
For a Markov state \(s\) and successor state \(s'\), the state transition probability is defined by
\[ \mathcal{P}_{ss'} = \mathbb{P}[S_{t+1} = s'\ |\ S_t = s] \]
A state transition matrix \(\mathcal{P}\) defines transition probabilities from all states \(s\) to all successor states \(s'\),
\[ \mathcal{P} \;=\; \textit{\textcolor{red}{from}}\; \begin{array}{c} \textit{\textcolor{red}{to}}\\[-0.25em] \left[ \begin{array}{ccc} \mathcal{P}_{00} & \cdots & \mathcal{P}_{0n}\\ \vdots & \ddots & \vdots\\ \mathcal{P}_{n0} & \cdots & \mathcal{P}_{nn} \end{array} \right] \end{array} \]
where each row of the matrix sums to \(1\).
A Markov process is a memoryless random process, i.e. a sequence of random states \(S_0, S_1, \ldots\) with the Markov property.
Definition: Markov Process (or Markov Chain)
A Markov Process (or Markov Chain) is a tuple \(<\mathcal{S}, \mathcal{P} >\)
\(\mathcal{S}\) is a (finite) set of states
\(\mathcal{P}\) is a state transition probability matrix,
\(\mathcal{P}_{ss'} = \mathbb{P}[S_{t+1} = s'\ |\ S_t = s]\)
Sample episodes for Student Markov Chain (starting from \(S_0 = C1\)): \(S_0; S_1; \ldots, S_T\)
C1 C2 C3 Pass Sleep
C1 In In C1 C2 C3 Pass Sleep
C1 C2 C3 Bar C2 C3 Pass Sleep
C1 In In C1 C2 C3 Bar C1 In In In C1 C2 C3 Bar C2 C3 Pass Sleep
\[ \tiny \mathcal{P} = \begin{bmatrix} & \textit{C1} & \textit{C2} & \textit{C3} & \textit{Pass} & \textit{Bar} & \textit{In} & \textit{Sleep} \\ \textit{C1} & & 0.5 & & & & 0.5 & \\ \textit{C2} & & & 0.8 & & & & 0.2 \\ \textit{C3} & & & & 0.6 & 0.4 & & \\ \textit{Pass} & & & & & & & 1.0 \\ \textit{Bar} & 0.2 & 0.4 & 0.4 & & & & \\ \textit{In} & 0.1 & & & & & 0.9 & \\ \textit{Sleep} & & & & & & & 1 \end{bmatrix} \normalsize \]
You will see some subtle differences in how classical planning and reinforcement learning handles the notation for probability.
The following link in the subject’s Ed Discussion forum clarifies the notation used in this subject:
https://edstem.org/au/courses/32360/lessons/102661/slides/705762 (follow the link from Canvas LMS for permission to view lesson).
What problems (how many) can be formalised as Markov processes?
Almost all problems can be formalised as Markov processes
Any non-Markov process can be turned into a Markov process if you define the state richly enough—at the extreme the ‘state’ can be the entire history up to now
This is not practical in all scenarios but surprisingly useful when appropriate state representation is learned (i.e. the language of states)
MRPs can handle infinite states (using function approximation which we will learn about in a later modules)
MRPs can handle memory, by encoding memory in Markov states
A Markov reward process is a Markov chain with values.
Definition: Markov Reward Process
A Markov Reward Process is a tuple \(< \mathcal{S}, \mathcal{P}, \mathcal{\textcolor{blue}{R}}, \textcolor{blue}{\gamma} >\)
\(\mathcal{S}\) is a finite set of states
\(\mathcal{P}\) is a state transition probability matrix, \(\mathcal{P}_{ss'} = \mathbb{P}[S_{t+1} = s'\ |\ S_t = s]\)
\(\color{blue}{\mathcal{R}}\) is a reward function, \(\color{blue}{\mathcal{R}(s) = \mathbb{E}[R_{t+1}\ |\ S_t = s]}\)
\(\textcolor{blue}{\gamma}\) is a discount factor, \(\color{blue}{\gamma \in [0, 1]}\)
Definition: Return
The return \(G_t\) is the total discounted reward from time-step \(t\). \[ G_t \;=\; R_{t+1} \;+\; \gamma R_{t+2} \;+\; \dots \;=\; \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \]
The discount \(\gamma \in [0, 1]\) is the present value of future rewards
The value of receiving reward \(R\) after \(k+1\) time-steps is \(\gamma^k R\).
This values immediate reward above delayed reward
\(\gamma\) close to \(0\) leads to “myopic” evaluation
\(\gamma\) close to \(1\) leads to “far-sighted” evaluation
Why are most Markov reward/decision processes discounted?
Mathematically convenient to discount rewards
Avoids infinite returns in cyclic Markov processes
Uncertainty about the future may not be fully represented
If the reward is financial, immediate rewards may earn more interest than delayed rewards
Animal/human behaviour shows preference for immediate reward
It is sometimes possible to use undiscounted Markov reward processes (i.e. \(\gamma = 1\)),
The value function \(v(s)\) gives the long-term value of state s
Definition: State Value Function
The state value function \(v(s)\) of an MRP is the expected return starting from state \(s\)
\[ v(s) = \mathbb{E}[G_t\ |\ S_t = s] \]
Over how many episodes is expected return computed?
Expected return is over all possible episodes
Sample returns for Student MRP episodes:
Starting from \(S_0 = C1\) with \(\gamma = \frac{1}{2}\)
\[ G_0 = R_1 + \gamma R_2 + \ldots + \gamma^{T-1} R_T \]
C1 C2 C3 Pass Sleep
C1 In In C1 C2 Pass Sleep
C1 C2 C3 Bar C2 C3 Pass Sleep
C1 In In C1 C2 C3 Bar C1 \(\ldots\)
In In In C1 C2 C3 Bar C2 C3 Pass Sleep
Showing just first four samples…
\[\begin{align*} v_0 &= -2 - 2 \cdot \tfrac{1}{2} - 2 \cdot \tfrac{1}{4} + 10 \cdot \tfrac{1}{8} &&= -2.25 \\[6pt] v_0 &= -2 - 1 \cdot \tfrac{1}{2} - 1 \cdot \tfrac{1}{4} - 2 \cdot \tfrac{1}{8} - 2 \cdot \tfrac{1}{16} &&= -3.125 \\[6pt] v_0 &= -2 - 2 \cdot \tfrac{1}{2} - 2 \cdot \tfrac{1}{4} + 1 \cdot \tfrac{1}{8} - 2 \cdot \tfrac{1}{16} \dots &&= -3.41 \\[6pt] v_0 &= -2 - 1 \cdot \tfrac{1}{2} - 1 \cdot \tfrac{1}{4} - 2 \cdot \tfrac{1}{8} - 2 \cdot \tfrac{1}{16} \dots &&= -3.20 \end{align*}\]
The value function can be decomposed into two parts:
the immediate reward, \({\color{blue}R_{t+1}}\), and
the discounted value of the successor state, \({\color{blue}\gamma v(S_{t+1})}\), by the law of iterated expectations
\[\begin{align*} v(s) &= \mathbb{E}\!\left[\, G_t \;\middle|\; S_t = s \,\right] \\[2pt] &= \mathbb{E}\!\left[\, R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \;\middle|\; S_t = s \,\right] \\[2pt] &= \mathbb{E}\!\left[\, R_{t+1} + \gamma \left( R_{t+2} + \gamma R_{t+3} + \dots \right) \;\middle|\; S_t = s \,\right] \\[2pt] &= \mathbb{E}\!\left[\, R_{t+1} + \gamma G_{t+1} \;\middle|\; S_t = s \,\right] \\[2pt] &= \mathbb{E}\!\left[\, {\color{blue}R_{t+1}} + {\color{blue}\gamma v(S_{t+1})} \;\middle|\; S_t = s \,\right] \end{align*}\]
\[ v(s)= \mathbb{E}[R_{t+1} + \gamma\ v(S_{t+1})\ |\ S_t = s] \]
\[ v(s) \ =\ \mathcal{R}_s \ +\ \gamma \sum_{s' \in \mathcal{S}} P_{ss'} v(s') \]
We integrate over probability by computing each \(v(s')\) in backup diagrams - add up reward for each branch and backup to root
What does the Bellman Equation look like for the Student MRP?
(with no discounting)
The Bellman equation can be expressed concisely using vectors and matrices,
\[ v = \mathcal{R} + \gamma \mathcal{P} v \]
where v is a column vector with one entry per state
\[ \begin{bmatrix} v(0) \\ \vdots \\ v(n) \end{bmatrix} = \begin{bmatrix} \mathcal{R}_0 \\ \vdots \\ \mathcal{R}_n \end{bmatrix} + \gamma \begin{bmatrix} \mathcal{P}_{00} & \cdots & \mathcal{P}_{0n} \\ \vdots & \ddots & \vdots \\ \mathcal{P}_{n0} & \cdots & \mathcal{P}_{nn} \end{bmatrix} \begin{bmatrix} v(0) \\ \vdots \\ v(n) \end{bmatrix} \]
Since the Bellman equation is a linear equation of matrices and vectors
It can be solved directly for evaluating rewards, assuming matrix is small enough to invert
Note that this won’t be true when we move to Markov decision processes where we wish to optimise rewards
\[\begin{align*} v &= \mathcal{R} + \gamma \mathcal{P} v \\[6pt] (I - \gamma \mathcal{P}) v &= \mathcal{R} \\[6pt] v &= (I - \gamma \mathcal{P})^{-1} \mathcal{R} \end{align*}\]
Note that the computational complexity of inverting this matrix is \(O(n^3)\) for \(n\) states, hence:
We will explore iterative solution methods for doing this more efficiently for large MRPs in future modules, including
Dynamic programming
Monte-Carlo evaluation
Temporal-Difference learning
Imagine that agent A must reach G, moving one cell at a time in known map
Different combinations of uncertainty and feedback lead to two problems: two models
Classical planning is typically specified symbolically via
a state-transition model (states, actions, initial state, goal),
STRIPS-style action semantics (how action descriptions change states via preconditions and effects), and
PDDL as the standard syntax for writing the domain + problem that, under the chosen semantics, induces the planning model.
Definition: State-Transition Model
A solution is a sequence of applicable actions that maps \(s_0\) into \(S_G\), and it is optimal if it minimizes sum of action costs (e.g., # of steps)
Different models and controllers obtained by relaxing assumptions in blue …
Definition: STRIPS Problem
A problem in STRIPS is a tuple \(P = \langle F,O,I,G\rangle\):
\(F\) stands for set of all facts (atoms) (boolean variables)
\(O\) stands for set of all operators (actions)
\(I \subseteq F\) stands for initial situation
\(G \subseteq F\) stands for goal situation
Operators \(o \in O\) represented by:
the Add list \(Add(o) \subseteq F\)
the Delete list \(Del(o) \subseteq F\)
the Precondition list \(Pre(o) \subseteq F\)
Definition: STRIPS Semantics
A STRIPS problem \(P = \langle F,O,I,G\rangle\) determines a state model \(\mathcal{S}(P)\) where
the states \(s \in S\) are collections of facts (atoms) from \(F\), where \(S = 2^F\)
the initial state \(s_0\) is \(I\)
the goal states \(s\) are such that \(G \subseteq s\)
the actions \(a\) in \(A(s)\) are operators in \(O\) such that \(Prec(a) \subseteq s\)
the next state is \(s' = s - Del(a) + Add(a)\)
action costs \(c(a,s)\) are all \(1\)
Propositions: \(\mathit{on}(x,y)\), \(\mathit{onTable}(x)\), \(\mathit{clear}(x)\), \(\mathit{holding}(x)\), \(\mathit{armEmpty}()\).
Initial state: \(\{\mathit{onTable}(E), \mathit{clear}(E), \dots, \mathit{onTable}(C), \mathit{on}(D,C), \mathit{clear}(D),\)
\(\mathit{armEmpty}()\}\).
Goal: \(\{\mathit{on}(E,C), \mathit{on}(C,A), \mathit{on}(B,D)\}\).
Actions: \(\mathit{stack}(x,y)\), \(\mathit{unstack}(x,y)\), \(\mathit{putdown}(x)\), \(\mathit{pickup}(x)\).
\(\mathit{stack}(x,y)\)?
\(pre: \{\mathit{holding}(x),\mathit{clear}(y)\}\)
\(add: \{\mathit{on}(x,y),\mathit{armEmpty}(),\mathit{clear}(x)\}\ \ \ del: \{\mathit{holding}(x),\mathit{clear}(y)\}.\)
PDDL is not a propositional language:
A PDDL planning task comes in two pieces:
(define (domain logistics)
(:requirements :strips :typing :equality)
(:types airport - location truck airplane - vehicle vehicle packet - thing ..)
(:predicates (loc-at ?x - location ?y - city) (at ?x - thing ?y - location) ...)
(:action load
:parameters (?x - packet ?y - vehicle ?z - location)
:precondition (and (at ?x ?z) (at ?y ?z))
:effect (and (not (at ?x ?z)) (in ?x ?y)))
(:action unload ..)
(:action drive
:parameters (?x - truck ?y - location ?z - location ?c - city)
:precondition (and (loc-at ?z ?c) (loc-at ?y ?c) (not (= ?z ?y)) (at ?x ?z))
:effect (and (not (at ?x ?z)) (at ?x ?y)))
...
(define (problem log3_2)
(:domain logistics)
(:objects packet1 packet2 - packet truck1 truck2 truck3 - truck airplane1 ...)
(:init (at packet1 office1) (at packet2 office3) ...)
(:goal (and (at packet1 office2) (at packet2 office2))))Depends whether we know the propositions, states and/or the probabilities…
If we know the states and probabilities: MRP
If we know the propositions (and details of STRIPS model): Classical Planning
If we don’t know the propositions and/or the probabilities, we can typically learn them (later modules)
If we know propositions and probabilities: We can use both MRPs and STRIPS/PDDL (we will see integrated learning and planning in later modules)
Propositions: \(\mathit{studied}(x)\), \(\mathit{bar}()\), \(\mathit{insta}()\), \(\mathit{sleeping}()\), \(\mathit{passed()}\)
Initial state: \(\mathit{insta()}\)
Goal: \(\mathit{passed()}\), \(\mathit{sleeping()}\)
Actions (operators): \(\mathit{sleep}\), \(\mathit{study(x)}\), \(\mathit{scroll}\), \(\mathit{study(x)}\), \(\mathit{gotoBar}\), \(\mathit{sitExam}\)
Propositions: ?
Initial state: ?
Goal: ?
Actions (operators): ?
| Markov reward process (MRP) | Classical Planning |
|---|---|
| probabilities | preconditions, add list & delete list |
| states | facts (atoms) / propositions |
| non-deterministic actions | deterministic actions |
| policies | initial state & plans |
| actions are stochastic (stochastic policies) | actions are deterministic |
| rewards | goals |
For a categorisation of agents, see the Appendix.
Reinforcement Learning, An Introduction by Richard Sutton and Andrew Barto, Second Edition MIT Press (2020) (available for download at http://www.incompleteideas.net/book/RLbook2020.pdf)
Planning textbooks at https://comp90054.github.io/reinforcement-learning/
Everything You Always Wanted to Know About Planning (But Were Afraid to Ask) Joerg Hoffmann (2011).
Available at: http://fai.cs.uni-saarland.de/hoffmann/papers/ki11.pdf
Content: Joerg’s personal perspective on planning, for example excerpt from the abstract: “What is it good for? Does it work? Is it interesting to do research in?”
Introduction to STRIPS, from simple games to StarCraft:
http://www.primaryobjects.com/2015/11/06/artificial-intelligence-planning-with-strips-a-gentle-introduction/
Online/Offline editor to model in PDDL: http://editor.planning.domains
VS Code PDDL extension: https://marketplace.visualstudio.com/items?itemName=jan-dolejsi.pddl