01 Modelling: Markov Processes versus Classical Planning, STRIPS & PDDL
Slides
This module is also available in the following versions
Markov Processes
Introduction to Markov Processes
A Markov process formally describes an environment for planning and learning
Where the environment is fully observable
I.e. The current state completely characterises the process (the Markov condition)
Markov Property (Revisited)
“The future is independent of the past given the present”
The state captures all relevant information from the history
Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future
State Transition Matrix
For a Markov state \(s\) and successor state \(s'\), the state transition probability is defined by
\[ \mathcal{P}_{ss'} = \mathbb{P}[S_{t+1} = s'\ |\ S_t = s] \]
A state transition matrix \(\mathcal{P}\) defines transition probabilities from all states \(s\) to all successor states \(s'\),
\[ \mathcal{P} \;=\; \textit{\textcolor{red}{from}}\; \begin{array}{c} \textit{\textcolor{red}{to}}\\[-0.25em] \left[ \begin{array}{ccc} \mathcal{P}_{00} & \cdots & \mathcal{P}_{0n}\\ \vdots & \ddots & \vdots\\ \mathcal{P}_{n0} & \cdots & \mathcal{P}_{nn} \end{array} \right] \end{array} \]
where each row of the matrix sums to \(1\).
Markov Processes
A Markov process is a memoryless random process, i.e. a sequence of random states \(S_0, S_1, \ldots\) with the Markov property.
Example: Student Markov Chain
Example: Student Markov Chain Episodes
Sample episodes for Student Markov Chain (starting from \(S_0 = C1\)): \(S_0; S_1; \ldots, S_T\)
C1 C2 C3 Pass Sleep
C1 In In C1 C2 C3 Pass Sleep
C1 C2 C3 Bar C2 C3 Pass Sleep
C1 In In C1 C2 C3 Bar C1 In In In C1 C2 C3 Bar C2 C3 Pass Sleep
Example: Student Markov Chain Transition Matrix
\[ \small \mathcal{P} = \begin{bmatrix} & \textit{C1} & \textit{C2} & \textit{C3} & \textit{Pass} & \textit{Bar} & \textit{In} & \textit{Sleep} \\ \textit{C1} & & 0.5 & & & & 0.5 & \\ \textit{C2} & & & 0.8 & & & & 0.2 \\ \textit{C3} & & & & 0.6 & 0.4 & & \\ \textit{Pass} & & & & & & & 1.0 \\ \textit{Bar} & 0.2 & 0.4 & 0.4 & & & & \\ \textit{In} & 0.1 & & & & & 0.9 & \\ \textit{Sleep} & & & & & & & 1 \end{bmatrix} \normalsize \]
Notation for Probability
You will see some subtle differences in how classical planning and reinforcement learning handles the notation for probability.
The following link in the subject’s Ed Discussion forum clarifies the notation used in this subject:
https://edstem.org/au/courses/32360/lessons/102661/slides/705762 (follow the link from Canvas LMS for permission to view lesson).
Markov Processes?
What problems (how many) can be formalised as Markov processes?
Almost all problems can be formalised as Markov processes
Any non-Markov process can be turned into a Markov process if you define the state richly enough—at the extreme the ‘state’ can be the entire history up to now
This is not practical in all scenarios but surprisingly useful when appropriate state representation is learned (i.e. the language of states)
MRPs can handle infinite states (using function approximation which we will learn about in a later modules)
MRPs can handle memory, by encoding memory in Markov states
Markov Reward Processes
A Markov reward process is a Markov chain with values.
Example: Student Markov Reward Process (MRP)
Return
Why discount?
Why are most Markov reward/decision processes discounted?
Mathematically convenient to discount rewards
Avoids infinite returns in cyclic Markov processes
Uncertainty about the future may not be fully represented
If the reward is financial, immediate rewards may earn more interest than delayed rewards
Animal/human behaviour shows preference for immediate reward
Why discount (continued)?
It is sometimes possible to use undiscounted Markov reward processes (i.e. \(\gamma = 1\)),
- e.g. especially if we can guarantee that all sequences terminate.
Value Function
The value function \(v(s)\) gives the long-term value of state s
Over how many episodes is expected return computed?
Expected return is over all possible episodes
Example: Student MRP Returns
Sample returns for Student MRP episodes:
Starting from \(S_0 = C1\) with \(\gamma = \frac{1}{2}\)
\[ G_0 = R_1 + \gamma R_2 + \ldots + \gamma^{T-1} R_T \]
C1 C2 C3 Pass Sleep:
C1 In In C1 C2 Pass Sleep:
C1 C2 C3 Bar C2 C3 Pass Sleep:
C1 In In C1 C2 C3 Bar C1 \(\ldots\)
In In In C1 C2 C3 Bar C2 C3 Pass Sleep:
showing first four samples…
\[\begin{align*} v_0 &= -2 - 2 \cdot \tfrac{1}{2} - 2 \cdot \tfrac{1}{4} + 10 \cdot \tfrac{1}{8} &&= -2.25 \\[6pt] v_0 &= -2 - 1 \cdot \tfrac{1}{2} - 1 \cdot \tfrac{1}{4} - 2 \cdot \tfrac{1}{8} - 2 \cdot \tfrac{1}{16} &&= -3.125 \\[6pt] v_0 &= -2 - 2 \cdot \tfrac{1}{2} - 2 \cdot \tfrac{1}{4} + 1 \cdot \tfrac{1}{8} - 2 \cdot \tfrac{1}{16} \dots &&= -3.41 \\[6pt] v_0 &= -2 - 1 \cdot \tfrac{1}{2} - 1 \cdot \tfrac{1}{4} - 2 \cdot \tfrac{1}{8} - 2 \cdot \tfrac{1}{16} \dots &&= -3.20 \end{align*}\]
Example: State-Value Function for Student MRP (1)
Example: State-Value Function for Student MRP (2)
Example: State-Value Function for Student MRP (3)
Bellman Equation
Bellman Equation for MRPs
The value function can be decomposed into two parts:
the immediate reward, \({\color{blue}R_{t+1}}\), and
the discounted value of the successor state, \({\color{blue}\gamma v(S_{t+1})}\), by the law of iterated expectations
\[\begin{align*} v(s) &= \mathbb{E}\!\left[\, G_t \;\middle|\; S_t = s \,\right] \\[2pt] &= \mathbb{E}\!\left[\, R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \;\middle|\; S_t = s \,\right] \\[2pt] &= \mathbb{E}\!\left[\, R_{t+1} + \gamma \left( R_{t+2} + \gamma R_{t+3} + \dots \right) \;\middle|\; S_t = s \,\right] \\[2pt] &= \mathbb{E}\!\left[\, R_{t+1} + \gamma G_{t+1} \;\middle|\; S_t = s \,\right] \\[2pt] &= \mathbb{E}\!\left[\, {\color{blue}R_{t+1}} + {\color{blue}\gamma v(S_{t+1})} \;\middle|\; S_t = s \,\right] \end{align*}\]
Bellman Equation for MRPs (look ahead)
\[ v(s)= \mathbb{E}[R_{t+1} + \gamma\ v(S_{t+1})\ |\ S_t = s] \]
\[ v(s) \ =\ \mathcal{R}_s \ +\ \gamma \sum_{s' \in \mathcal{S}} P_{ss'} v(s') \]
We integrate over probability by computing each \(v(s')\) in backup diagrams - add up reward for each branch and backup to root
Example: Bellman Equation for Student MRP
What does the Bellman Equation look like for the Student MRP?
Example: Computing Bellman Equation via Backup Diagrams
(with no discounting)
Bellman Equation in Matrix Form
The Bellman equation can be expressed concisely using vectors and matrices,
\[ v = \mathcal{R} + \gamma \mathcal{P} v \]
where v is a column vector with one entry per state
\[ \begin{bmatrix} v(0) \\ \vdots \\ v(n) \end{bmatrix} = \begin{bmatrix} \mathcal{R}_0 \\ \vdots \\ \mathcal{R}_n \end{bmatrix} + \gamma \begin{bmatrix} \mathcal{P}_{00} & \cdots & \mathcal{P}_{0n} \\ \vdots & \ddots & \vdots \\ \mathcal{P}_{n0} & \cdots & \mathcal{P}_{nn} \end{bmatrix} \begin{bmatrix} v(0) \\ \vdots \\ v(n) \end{bmatrix} \]
Solving the Bellman Equation
Since the Bellman equation is a linear equation of matrices and vectors
It can be solved directly for evaluating rewards, assuming matrix is small enough to invert
Note that this won’t be true when we move to Markov decision processes where we wish to optimise rewards
\[\begin{align*} v &= \mathcal{R} + \gamma \mathcal{P} v \\[6pt] (I - \gamma \mathcal{P}) v &= \mathcal{R} \\[6pt] v &= (I - \gamma \mathcal{P})^{-1} \mathcal{R} \end{align*}\]
Note that the computational complexity of inverting this matrix is \(O(n^3)\) for \(n\) states, hence:
- Direct solution only possible for small MRPs
We will explore iterative solution methods for doing this more efficiently for large MRPs in future modules, including
Dynamic programming
Monte-Carlo evaluation
Temporal-Difference learning
Modelling Languages: STRIPS & Planning Domain Definition Language (PDDL)
Consider Example Problem:
Imagine that agent A must reach G, moving one cell at a time in known map
- If the actions are stochastic and the locations (states) are observable, the problem is a Markov Decision Process (MDP)
- If the actions are deterministic and only the initial location is known, the problem is a Classical Planning Problem
Different combinations of uncertainty and feedback lead to two problems: two models
STRIPS & Planning Domain Definition Language (PDDL)
Classical planning is typically specified symbolically via
a state-transition model (states, actions, initial state, goal),
STRIPS-style action semantics (how action descriptions change states via preconditions and effects), and
PDDL as the standard syntax for writing the domain + problem that, under the chosen semantics, induces the planning model.
State-Transition Model
A solution is a sequence of applicable actions that maps \(s_0\) into \(S_G\), and it is optimal if it minimizes sum of action costs (e.g., # of steps)
Different models and controllers obtained by relaxing assumptions in blue …
A Language for Classical Planning: STRIPS
From Language to Models (STRIPS Semantics)
- An optimal solution of \(P\) is an optimal solution of the state model \(\mathcal{S}(P)\)
- Language extensions often convenient: negation, conditional effects, non-boolean variables, etc.
Example: The Blocksworld
Propositions: \(\mathit{on}(x,y)\), \(\mathit{onTable}(x)\), \(\mathit{clear}(x)\), \(\mathit{holding}(x)\), \(\mathit{armEmpty}()\).
Initial state: \(\{\mathit{onTable}(E), \mathit{clear}(E), \dots, \mathit{onTable}(C), \mathit{on}(D,C), \mathit{clear}(D),\)
\(\mathit{armEmpty}()\}\).Goal: \(\{\mathit{on}(E,C), \mathit{on}(C,A), \mathit{on}(B,D)\}\).
Actions: \(\mathit{stack}(x,y)\), \(\mathit{unstack}(x,y)\), \(\mathit{putdown}(x)\), \(\mathit{pickup}(x)\).
\(\mathit{stack}(x,y)\)?
\(pre: \{\mathit{holding}(x),\mathit{clear}(y)\}\)
\(add: \{\mathit{on}(x,y),\mathit{armEmpty}(),\mathit{clear}(x)\}\ \ \ del: \{\mathit{holding}(x),\mathit{clear}(y)\}.\)
PDDL Quick Facts
PDDL is not a propositional language:
- Representation is lifted, using object variables which are instantiated from a finite set of objects (similar to predicate logic which quantifies over variables).
- Action schemas are parameterized by objects.
- Predicates are to be instantiated with objects.
A PDDL planning task comes in two pieces:
- The domain file and the problem file.
- The problem file gives the objects, the initial state, and the goal state.
- The domain file gives the predicates and the operators; each domain typically has only one file.
The Blocksworld in PDDL: Domain File
(define (domain blocksworld)
(:predicates (clear ?x) (holding ?x) (on ?x ?y)
(on-table ?x) (arm-empty))
(:action stack
:parameters (?x ?y)
:precondition (and (clear ?y) (holding ?x))
:effect (and (arm-empty) (on ?x ?y)
(not (clear ?y)) (not (holding ?x)))
)The Blocksworld in PDDL: Problem File
(define (problem bw-abcde)
(:domain blocksworld)
(:objects a b c d e)
(:init (on-table a) (clear a)
(on-table b) (clear b)
(on-table e) (clear e)
(on-table c) (on d c) (clear d)
(arm-empty))
(:goal (and (on e c) (on c a) (on b d))))Example: Logistics in STRIPS PDDL
(define (domain logistics)
(:requirements :strips :typing :equality)
(:types airport - location truck airplane - vehicle vehicle packet - thing ..)
(:predicates (loc-at ?x - location ?y - city) (at ?x - thing ?y - location) ...)
(:action load
:parameters (?x - packet ?y - vehicle ?z - location)
:precondition (and (at ?x ?z) (at ?y ?z))
:effect (and (not (at ?x ?z)) (in ?x ?y)))
(:action unload ..)
(:action drive
:parameters (?x - truck ?y - location ?z - location ?c - city)
:precondition (and (loc-at ?z ?c) (loc-at ?y ?c) (not (= ?z ?y)) (at ?x ?z))
:effect (and (not (at ?x ?z)) (at ?x ?y)))
...
(define (problem log3_2)
(:domain logistics)
(:objects packet1 packet2 - packet truck1 truck2 truck3 - truck airplane1 ...)
(:init (at packet1 office1) (at packet2 office3) ...)
(:goal (and (at packet1 office2) (at packet2 office2))))When is a problem MRP or STRIPS/PDDL?
Depends whether we know the propositions, states and/or the probabilities…
If we know the states and probabilities: MRP
If we know the propositions (and details of STRIPS model): Classical Planning
If we don’t know the propositions and/or the probabilities, we can typically learn them (later modules)
If we know propositions and probabilities: We can use both MRPs and STRIPS/PDDL (we will see integrated learning and planning in later modules)
Example: STRIPS version of Student?
Example: STRIPS version of Student?
Example: STRIPS version of Student?
Propositions: \(\mathit{studied}(x)\), \(\mathit{bar}()\), \(\mathit{insta}()\), \(\mathit{sleeping}()\), \(\mathit{passed()}\)
Initial state: \(\mathit{insta()}\)
Goal: \(\mathit{passed()}\), \(\mathit{sleeping()}\)
Actions (operators): \(\mathit{sleep}\), \(\mathit{study(x)}\), \(\mathit{scroll}\), \(\mathit{study(x)}\), \(\mathit{gotoBar}\), \(\mathit{sitExam}\)
Example: STRIPS version of Student?
Propositions: ?
Initial state: ?
Goal: ?
Actions (operators): ?
Summary
Difference between modelling as MRPs versus Classical Planning?
- Classical search problems require finding a path of actions leading from an initial state to a goal state.
- They assume a single-agent, fully-observable, deterministic, static environment.
- STRIPS is a simple but reasonably expressive language: Boolean facts + actions with precondition, add list, delete list.
- PDDL is the de-facto standard language for describing planning problems.
Difference between modelling as MRPs versus Classical Planning?
- Markov processes require transition probabilities from state to state.
- They do not assume single-agent - they include both agent and environment transitions, or event multi-agent transitions.
- They capture non-deterministic, dynamic environments, and partially obervability is easily included.
- Dynamics can either be learned (model-free) or provided (model-based).
Difference between MRPs & Classical Planning: Table
| Markov reward process (MRP) | Classical Planning |
|---|---|
| probabilities | preconditions, add list & delete list |
| states | facts (atoms) / propositions |
| non-deterministic actions | deterministic actions |
| policies | initial state & plans |
| actions are stochastic (stochastic policies) | actions are deterministic |
| rewards | goals |
Categorising Agents
For a categorisation of agents, see Appendix (Reference Topics)).
Reading - Markov Processes & Markov Reward Processes
Reinforcement Learning, An Introduction by Richard Sutton and Andrew Barto, Second Edition MIT Press (2020) (available for download at http://www.incompleteideas.net/book/RLbook2020.pdf)
- See Chapter 3 Finite Markov Decision Processes
Reading - Classical Planning
Planning textbooks at https://comp90054.github.io/reinforcement-learning/
Everything You Always Wanted to Know About Planning (But Were Afraid to Ask) Joerg Hoffmann (2011).
Available at: http://fai.cs.uni-saarland.de/hoffmann/papers/ki11.pdf
Content: Joerg’s personal perspective on planning, for example excerpt from the abstract: “What is it good for? Does it work? Is it interesting to do research in?”
Additional Reading - Classical Planning
Introduction to STRIPS, from simple games to StarCraft:
http://www.primaryobjects.com/2015/11/06/artificial-intelligence-planning-with-strips-a-gentle-introduction/Online/Offline editor to model in PDDL: http://editor.planning.domains
VS Code PDDL extension: https://marketplace.visualstudio.com/items?itemName=jan-dolejsi.pddl