7.2 Introduction to MDPs

Slides

This module is also available in the following versions

Introduction to MDPs

Markov decision processes formally describe an environment for reinforcement learning

Where the environment is fully observable -i.e. The current state completely characterises the process

Almost all RL problems can be formalised as MDPs, e.g.

Optimal control primarily deals with continuous MDPs
Partially observable problems can be converted into MDPs
Bandits (from machine learning) are MDPs with just one state

Markov Property

“The future is independent of the past given the present”

Definition:

A state \(S_t\) is Markov if and only if \[ \mathbb{P} [S_{t+1} | St ] = \mathbb{P} [S_{t+1}\ |\ S_1, \ldots, S_t ] \]

The state captures all relevant information from the history

Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future

State Transition Matrix

For a Markov state \(s\) and successor state \(s'\), the state transition probability is defined by

\[ \mathcal{P}_{ss'} = \mathbb{P}[S_{t+1} = s'\ |\ S_t = s] \]

A state transition matrix \(\mathcal{P}\) defines transition probabilities from all states \(s\) to all successor states \(s'\),

\[ \mathcal{P} \;=\; \textit{\textcolor{red}{from}}\; \begin{array}{c} \textit{\textcolor{red}{to}}\\[-0.25em] \left[ \begin{array}{ccc} \mathcal{P}_{11} & \cdots & \mathcal{P}_{1n}\\ \vdots & \ddots & \vdots\\ \mathcal{P}_{n1} & \cdots & \mathcal{P}_{nn} \end{array} \right] \end{array} \]

where each row of the matrix sums to \(1\).

Markov Process

A Markov process is a memoryless random process, i.e. a sequence of random states \(S_1, S_2, \ldots\) with the Markov property.

Definition

A Markov Process (or Markov Chain) is a tuple \(<\mathcal{S}, \mathcal{P} >\)

\(\mathcal{S}\) is a (finite) set of states
\(\mathcal{P}\) is a state transition probability matrix,

\(\mathcal{P}_{ss'} = \mathbb{P}[S_{t+1} = s'\ |\ S_t = s]\)

Example: Student Markov Chain

Example: Student Markov Chain Episodes

Sample episodes for Student Markov Chain (starting from \(S_1 = C1\)): \(S_1; S_2; \ldots, S_T\)

C1 C2 C3 Pass Sleep
C1 In In C1 C2 C3 Pass Sleep
C1 C2 C3 Bar C2 C3 Pass Sleep
C1 In In C1 C2 C3 Bar C1 In In In C1 C2 C3 Bar C2 Pass Sleep

Example: Student Markov Chain Transition Matrix

\[ \tiny \mathcal{P} = \begin{bmatrix} & \textit{C1} & \textit{C2} & \textit{C3} & \textit{Pass} & \textit{Bar} & \textit{In} & \textit{Sleep} \\ \textit{C1} & & 0.5 & & & & 0.5 & \\ \textit{C2} & & & 0.8 & & & & 0.2 \\ \textit{C3} & & & & 0.6 & 0.4 & & \\ \textit{Pass} & & & & & & & 1.0 \\ \textit{Bar} & 0.2 & 0.4 & 0.4 & & & & \\ \textit{In} & 0.1 & & & & & 0.9 & \\ \textit{Sleep} & & & & & & & 1 \end{bmatrix} \normalsize \]

Markov Reward Processes

A Markov reward process is a Markov chain with values.

Definition

A Markov Reward Process is a tuple \(< \mathcal{S}, \mathcal{P}, \mathcal{\textcolor{red}{R}}, \textcolor{red}{\gamma} >\)

\(\mathcal{S}\) is a finite set of states
\(\mathcal{P}\) is a state transition probability matrix, \(\mathcal{P}_{ss'} = \mathbb{P}[S_{t+1} = s'\ |\ S_t = s]\)
\(\color{red}{\mathcal{R}}\) is a reward function, \(\color{red}{\mathcal{R}_s = \mathbb{E}[R_{t+1}\ |\ S_t = s]}\)
\(\textcolor{red}{\gamma}\) is a discount factor, \(\color{red}{\gamma \in [0, 1]}\)

Example: Student MRP

Return

Definition The return \(G_t\) is the total discounted reward from time-step \(t\). \[ G_t \;=\; R_{t+1} \;+\; \gamma R_{t+2} \;+\; \dots \;=\; \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \]

\[ G_t \;=\; R_{t+1} \;+\; \gamma R_{t+2} \;+\; \dots \;=\; \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \]

The discount \(\gamma \in [0, 1]\) is the present value of future rewards
The value of receiving reward \(R\) after \(k+1\) time-steps is \(\gamma^k R\).
This values immediate reward above delayed reward
- \(\gamma\) close to \(0\) leads to “myopic” evaluation
- \(\gamma\) close to \(1\) leads to “far-sighted” evaluation

Why discount?

Most Markov reward/decision processes are discounted, why?

Mathematically convenient to discount rewards
Avoids infinite returns in cyclic Markov processes
Uncertainty about the future may not be fully represented
If the reward is financial, immediate rewards may earn more interest than delayed rewards
Animal/human behaviour shows preference for immediate reward

It is sometimes possible to use undiscounted Markov reward processes (i.e. \(\gamma = 1\)),
e.g. especially if we can guarantee that all sequences terminate.

Value Function

The value function \(v(s)\) gives the long-term value of state s

Definition

The state value function \(v(s)\) of an MRP is the expected return starting from state \(s\)

\[ v(s) = \mathbb{E}[G_t\ |\ S_t = s] \]

Note that expected return is over all possible episodes

Example: Student MRP Returns

Sample returns for Student MRP episodes:

Starting from \(S_1 = C1\) with \(\gamma = \frac{1}{2}\)

\[ G_1 = R_2 + \gamma R_3 + \ldots + \gamma^{T-2} R_T \]

C1 C2 C3 Pass Sleep

C1 In In C1 C2 Pass Sleep

C1 C2 C3 Bar C2 C3 Pass Sleep

C1 In In C1 C2 C3 Bar C1 \(\ldots\)
In In In C1 C2 C3 Bar C2 Pass Sleep

showing first four samples here…

\[\begin{align*} v_1 &= -2 - 2 \cdot \tfrac{1}{2} - 2 \cdot \tfrac{1}{4} + 10 \cdot \tfrac{1}{8} &&= -2.25 \\[6pt] v_1 &= -2 - 1 \cdot \tfrac{1}{2} - 1 \cdot \tfrac{1}{4} - 2 \cdot \tfrac{1}{8} - 2 \cdot \tfrac{1}{16} &&= -3.125 \\[6pt] v_1 &= -2 - 2 \cdot \tfrac{1}{2} - 2 \cdot \tfrac{1}{4} + 1 \cdot \tfrac{1}{8} - 2 \cdot \tfrac{1}{16} \dots &&= -3.41 \\[6pt] v_1 &= -2 - 1 \cdot \tfrac{1}{2} - 1 \cdot \tfrac{1}{4} - 2 \cdot \tfrac{1}{8} - 2 \cdot \tfrac{1}{16} \dots &&= -3.20 \end{align*}\]

Example: State-Value Function for Student MRP (1)

Example: State-Value Function for Student MRP (2)

Example: State-Value Function for Student MRP (3)

Bellman Equation for MRPs

The value function can be decomposed into two parts:

the immediate reward, \(R_{t+1}\), and
the discounted value of the successor state, \(\gamma\ v(S_{t+1})\), by the law of iterated expectations

\[\begin{align*} v(s) &= \mathbb{E}\!\left[\, G_t \;\middle|\; S_t = s \,\right] \\[2pt] &= \mathbb{E}\!\left[\, R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \;\middle|\; S_t = s \,\right] \\[2pt] &= \mathbb{E}\!\left[\, R_{t+1} + \gamma \left( R_{t+2} + \gamma R_{t+3} + \dots \right) \;\middle|\; S_t = s \,\right] \\[2pt] &= \mathbb{E}\!\left[\, R_{t+1} + \gamma G_{t+1} \;\middle|\; S_t = s \,\right] \\[2pt] &= \mathbb{E}\!\left[\, R_{t+1} + \gamma v(S_{t+1}) \;\middle|\; S_t = s \,\right] \end{align*}\]

Bellman Equation for MRPs (look ahead)

\[ v(s)= \mathbb{E}[R_{t+1} + \gamma\ v(S_{t+1})\ |\ S_t = s] \]

\[ v(s) \ =\ \mathcal{R}_s \ +\ \gamma \sum_{s' \in \mathcal{S}} P_{ss'} v(s') \]

We integrate over probability by computing each \(v(s')\) in backup diagrams

Example: Bellman Equation for Student MRP

(with no discounting)

Bellman Equation in Matrix Form

The Bellman equation can be expressed concisely using vectors and matrices,

\[ v = \mathcal{R} + \gamma \mathcal{P} v \]

where v is a column vector with one entry per state

\[ \begin{bmatrix} v(1) \\ \vdots \\ v(n) \end{bmatrix} = \begin{bmatrix} \mathcal{R}_1 \\ \vdots \\ \mathcal{R}_n \end{bmatrix} + \gamma \begin{bmatrix} \mathcal{P}_{11} & \cdots & \mathcal{P}_{1n} \\ \vdots & \ddots & \vdots \\ \mathcal{P}_{n1} & \cdots & \mathcal{P}_{nn} \end{bmatrix} \begin{bmatrix} v(1) \\ \vdots \\ v(n) \end{bmatrix} \]

Solving the Bellman Equation

The Bellman equation is a linear equation

It can be solved directly for evaluating rewards, assuming matrix is small enough to invert (Note that this won’t be true when we move to Markov Decision Processes where we wish to optimise rewards)

\[\begin{align*} v &= \mathcal{R} + \gamma \mathcal{P} v \\[6pt] (I - \gamma \mathcal{P}) v &= \mathcal{R} \\[6pt] v &= (I - \gamma \mathcal{P})^{-1} \mathcal{R} \end{align*}\]

Computational complexity of inverting this matrix is \(O(n^3)\) for \(n\) states

Direct solution only possible for small MRPs

There are many iterative methods for doing this more efficiently for large MRPs, e.g.

Dynamic programming
Monte-Carlo evaluation
Temporal-Difference learning

Markov Decision Processes

Markov Decision Process

A Markov decision process (MDP) is a Markov reward process with decisions.

It is an environment in which all states are Markov.
We introduced agency in terms of actions.

Definition

A Markov Decision Process is a tuple \(<\mathcal{S}, \mathcal{\textcolor{red}{A}}, \mathcal{P}, \mathcal{R}, \gamma>\)

\(\mathcal{S}\) is a finite set of states
\(\mathcal{\color{red}{A}}\) is a finite set of actions
\(\mathcal{P}\) is a state transition probability matrix, \(P^{\textcolor{red}{a}}_{ss'} = \mathbb{P} [S_{t+1} = s'\ | S_t = s, A_t = \textcolor{red}{a}]\)
\(\mathcal{R}\) is a reward function, \(\mathcal{R}^{\textcolor{red}{a}}_s = \mathbb{E}[R_{t+1}\ |\ S_t = s, A_t = {\textcolor{red}{a}}]\)
\(\gamma\) is a discount factor \(\gamma \in [0,1]\).

Example: Student MDP

Agent exerts control over MDP via actions, and goal is to find the best path through decision making process to maximise rewards

Policies (1)

Definition

A (stochastic) policy \(\pi\) is a distribution over actions given states,

\[ \pi(a \mid s) = \mathbb{P}\!\left[\, A_t = a \;\middle|\; S_t = s \,\right] \]

A policy fully defines the behaviour of an agent
MDP policies depend on the current state (not the history)
i.e. Policies are stationary (time-independent), \(A_t \sim \pi(\,\cdot \mid S_t), \quad \forall t > 0\)

Policies (2) - Recovering Markov reward process from MDP

Given an MDP \(\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \rangle\) and a policy \(\pi\)

The state sequence \(S_1, S_2, \ldots\) is a Markov process \(\langle \mathcal{S}, \mathcal{P}^\pi \rangle\)
The state and reward sequence \(S_1, R_2, S_2, \ldots\) is a Markov reward process \(\langle \mathcal{S}, \mathcal{P}^\pi, \mathcal{R}^\pi, \gamma \rangle\), where

\[ \mathcal{P}^{\pi}_{s,s'} = \sum_{a \in \mathcal{A}} \pi(a \mid s)\, \mathcal{P}^{a}_{s s'} \]

\[ \mathcal{R}^{\pi}_{s} = \sum_{a \in \mathcal{A}} \pi(a \mid s)\, \mathcal{R}^{a}_{s} \]

Value Function

Definition

The state-value function \(v_\pi(s)\) of an MDP is the expected return starting from state \(s\), and then following policy \(\pi\)

\[ v_\pi(s) = \mathbb{E}_\pi \!\left[\, G_t \;\middle|\; S_t = s \,\right] \]

Definition

The action-value function \(q_\pi(s, a)\) is the expected return starting from state \(s\), taking action \(a\), and then following policy \(\pi\)

\[ q_\pi(s, a) = \mathbb{E}_\pi \!\left[\, G_t \;\middle|\; S_t = s,\, A_t = a \,\right] \]

Example: State-Value Function for Student MDP

Bellman Expectation Equation

The state-value function can again be decomposed into immediate reward plus discounted value of successor state,

\[ v_\pi(s) = \mathbb{E}_\pi \!\left[\, R_{t+1} + \gamma v_\pi(S_{t+1}) \;\middle|\; S_t = s \,\right] \]

Can do the same thing for the \(q\) values: the action-value function can similarly be decomposed,

\[ q_\pi(s, a) = \mathbb{E}_\pi \!\left[\, R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) \;\middle|\; S_t = s,\, A_t = a \,\right] \]

Bellman Expectation Equation for \(V^{\pi}\) (look ahead)

\[ v_\pi(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s)\, q_\pi(s, a) \]

Bellman Expectation Equation for \(Q^{\pi}\) (look ahead)

\[ q_\pi(s, a) = \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{s s'}\, v_\pi(s') \]

Bellman Expectation Equation for \(v_{\pi} (2)\)

Bringing it together: agent actions (open circles), environment actions (closed circles)

\[ v_\pi(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) \left( \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{s s'} \, v_\pi(s') \right) \]

Bellman Expectation Equation for \(q_{\pi} (2)\)

The other way around: can do same thing for action values

\[ q_\pi(s, a) = \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{s s'} \sum_{a' \in \mathcal{A}} \pi(a' \mid s') \, q_\pi(s', a') \]

In both forms value function is (recursively) equal to reward of immediate state \(s\) + value \(s'\) (where you end up)

Example: Bellman Expectation Equation in Student MDP

Verify Bellman Equation to compute \(v_{\pi}(s)\) for \(s=C3\)

Bellman Expectation Equation (Matrix Form)

The Bellman expectation equation can be expressed concisely using the induced MRP (as before),

\[ v_\pi = \mathcal{R}^\pi + \gamma \mathcal{P}^\pi v_\pi \]

with direct solution

\[ v_\pi = (I - \gamma \mathcal{P}^\pi)^{-1} \mathcal{R}^\pi \]

Bellman equation gives us description of system can solve
Essentially averaging then computing inverse, although inefficient!

Optimal Value Function

Optimal Value Function (Finding the best behaviour)

Definition

The optimal state-value function \(v_\ast(s)\) is the maximum value function over all policies

\[ v_\ast(s) = \max_{\pi} v_\pi(s) \]

The optimal action-value function \(q_\ast(s, a)\) is the maximum action-value function over all policies

\[ q_\ast(s, a) = \max_{\pi} q_\pi(s, a) \]

The optimal value function specifies the best possible performance in the MDP.
An MDP is “solved” when we know the optimal value function.
If you know \(q_\ast\), you have the optimal value function
So solving means finding \(q_\ast\)

Example: Optimal Value Function for Student MDP

Gives us value function for each state \(s\) (not how to behave)

Example: Optimal Action-Value Function for Student MDP

Gives us best action, \(a\), for each state \(s\) (can choose)

Optimal Policy

Define a partial ordering over policies

\[ \pi \;\geq\; \pi' \quad \text{if } v_\pi(s) \;\geq\; v_{\pi'}(s), \;\forall s \]

Theorem For any Markov Decision Process

There exists an optimal policy \(\pi_\ast\) that is better than or equal to all other policies, \(\pi_\ast \geq \pi, \;\forall \pi\)
All optimal policies achieve the optimal value function, \(v_{\pi_\ast}(s) = v_\ast(s)\)
All optimal policies achieve the optimal action-value function, \(q_{\pi_\ast}(s,a) = q_\ast(s,a)\)

Finding an Optimal Policy

An optimal policy can be found by maximising over \(q_\ast(s,a)\),

\[ \pi_\ast(a \mid s) = \begin{cases} 1 & \text{if } a = \underset{a \in \mathcal{A}}{\arg\max}\; q_\ast(s,a) \\[6pt] 0 & \text{otherwise} \end{cases} \]

There is always a deterministic optimal policy for any MDP
If we know \(q_\ast(s,a)\), we immediately have the optimal policy

Example: Optimal Policy for Student MDP

Red arcs (actions) represent optimal policy: picks highest \(q_\ast\)

Bellman Optimality Equation for \(v_\ast\) (look ahead)

The optimal value functions are recursively related by the Bellman optimality equations:

\[ v_\ast(s) = \max\limits_{a} q_\ast(s,a) \]

Working backwards using backup diagrams we get \(v_\ast(s)\) - best action over all policies taking max instead of average

Bellman Optimality Equation for \(Q^\ast\) (look ahead)

\[ q_\ast(s,a) = \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} \, v_\ast(s') \]

Considers where the environment might take us by averaging (looking ahead) and backing up (inductively)

Bellman Optimality Equation for \(V^* (2)\)

Bringing it together (two-step look ahead): agent actions (open circles), environment actions (closed circles)

\[ v_\ast(s) = \max\limits_{a} \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} \, v_\ast(s') \]

Bellman Optimality Equation for \(Q^* (2)\)

\[ q_\ast(s,a) = \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} \, \max\limits_{a'} q_\ast(s',a') \]

Determines \(Q^{\ast}\) reordering from environments perspective

Example: Bellman Optimality Equation in Student MDP

Compute \(v_{\ast}(s)\) for \(s=C1\) looking one step ahead (no environment actions in \(C1\))

Solving the Bellman Optimality Equation

Bellman Optimality Equation is non-linear

No closed form solution (in general)

Many iterative solution methods

Value Iteration (not covered in this subject)
Policy Iteration (not covered in this subject)
Q-learning (covered in Module 9)
SARSA (covered in Module 9)