05 Markov Decision Processes (MDPs)

Slides

This module is also available in the following versions

Introduction to Markov Decision Processes (MDPs)

Markov decision processes formally describe an environment for reinforcement learning

Where the environment is fully observable -i.e. The current state completely characterises the process

Almost all RL problems can be formalised as MDPs, e.g.

  • Optimal control primarily deals with continuous MDPs
  • Partially observable problems can be converted into MDPs
  • Bandits (from machine learning) are MDPs with just one state

Markov Decision Process

A Markov decision process (MDP) is a Markov reward process with decisions.

  • It is an environment in which all states are Markov.

  • We introduced agency in terms of actions.


Definition

A Markov Decision Process is a tuple \(<\mathcal{S}, \mathcal{\textcolor{red}{A}}, \mathcal{P}, \mathcal{R}, \gamma>\)

  • \(\mathcal{S}\) is a finite set of states

  • \(\mathcal{\color{red}{A}}\) is a finite set of actions

  • \(\mathcal{P}\) is a state transition probability matrix, \(P^{\textcolor{red}{a}}_{ss'} = \mathbb{P} [S_{t+1} = s'\ | S_t = s, A_t = \textcolor{red}{a}]\)

  • \(\mathcal{R}\) is a reward function, \(\mathcal{R}^{\textcolor{red}{a}}_s = \mathbb{E}[R_{t+1}\ |\ S_t = s, A_t = {\textcolor{red}{a}}]\)

  • \(\gamma\) is a discount factor \(\gamma \in [0,1]\).

Example: Student MDP

Agent exerts control over MDP via actions, and goal is to find the best path through decision making process to maximise rewards

Policies (1)

Definition

A (stochastic) policy \(\pi\) is a distribution over actions given states,

\[ \pi(a \mid s) = \mathbb{P}\!\left[\, A_t = a \;\middle|\; S_t = s \,\right] \]

  • A policy fully defines the behaviour of an agent

  • MDP policies depend on the current state (not the history)

  • i.e. Policies are stationary (time-independent), \(A_t \sim \pi(\,\cdot \mid S_t), \quad \forall t > 0\)

Policies (2) - Recovering Markov reward process from MDP

Given an MDP \(\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \rangle\) and a policy \(\pi\)

  • The state sequence \(S_1, S_2, \ldots\) is a Markov process \(\langle \mathcal{S}, \mathcal{P}^\pi \rangle\)

  • The state and reward sequence \(S_1, R_2, S_2, \ldots\) is a Markov reward process \(\langle \mathcal{S}, \mathcal{P}^\pi, \mathcal{R}^\pi, \gamma \rangle\), where

\[ \mathcal{P}^{\pi}_{s,s'} = \sum_{a \in \mathcal{A}} \pi(a \mid s)\, \mathcal{P}^{a}_{s s'} \]

\[ \mathcal{R}^{\pi}_{s} = \sum_{a \in \mathcal{A}} \pi(a \mid s)\, \mathcal{R}^{a}_{s} \]

Value Function

Definition

The state-value function \(v_\pi(s)\) of an MDP is the expected return starting from state \(s\), and then following policy \(\pi\)

\[ v_\pi(s) = \mathbb{E}_\pi \!\left[\, G_t \;\middle|\; S_t = s \,\right] \]

Definition

The action-value function \(q_\pi(s, a)\) is the expected return starting from state \(s\), taking action \(a\), and then following policy \(\pi\)

\[ q_\pi(s, a) = \mathbb{E}_\pi \!\left[\, G_t \;\middle|\; S_t = s,\, A_t = a \,\right] \]

Example: State-Value Function for Student MDP

Bellman Expectation Equation

The state-value function can again be decomposed into immediate reward plus discounted value of successor state,

\[ v_\pi(s) = \mathbb{E}_\pi \!\left[\, R_{t+1} + \gamma v_\pi(S_{t+1}) \;\middle|\; S_t = s \,\right] \]


Can do the same thing for the \(q\) values: the action-value function can similarly be decomposed,

\[ q_\pi(s, a) = \mathbb{E}_\pi \!\left[\, R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) \;\middle|\; S_t = s,\, A_t = a \,\right] \]

Bellman Expectation Equation for \(V^{\pi}\) (look ahead)

 

\[ v_\pi(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s)\, q_\pi(s, a) \]

Bellman Expectation Equation for \(Q^{\pi}\) (look ahead)

 

\[ q_\pi(s, a) = \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{s s'}\, v_\pi(s') \]

Bellman Expectation Equation for \(v_{\pi} (2)\)

Bringing it together: agent actions (open circles), environment actions (closed circles)

\[ v_\pi(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) \left( \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{s s'} \, v_\pi(s') \right) \]

Bellman Expectation Equation for \(q_{\pi} (2)\)

The other way around: can do same thing for action values

\[ q_\pi(s, a) = \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{s s'} \sum_{a' \in \mathcal{A}} \pi(a' \mid s') \, q_\pi(s', a') \]

In both forms value function is (recursively) equal to reward of immediate state \(s\) + value \(s'\) (where you end up)

Example: Bellman Expectation Equation in Student MDP

Verify Bellman Equation to compute \(v_{\pi}(s)\) for \(s=C3\)

Bellman Expectation Equation (Matrix Form)

The Bellman expectation equation can be expressed concisely using the induced MRP (as before),

\[ v_\pi = \mathcal{R}^\pi + \gamma \mathcal{P}^\pi v_\pi \]

with direct solution

\[ v_\pi = (I - \gamma \mathcal{P}^\pi)^{-1} \mathcal{R}^\pi \]

  • Bellman equation gives us description of system can solve

  • Essentially averaging then computing inverse, although inefficient!

Optimal Value Function

Optimal Value Function (Finding the best behaviour)

Definition

The optimal state-value function \(v_\ast(s)\) is the maximum value function over all policies

\[ v_\ast(s) = \max_{\pi} v_\pi(s) \]

The optimal action-value function \(q_\ast(s, a)\) is the maximum action-value function over all policies

\[ q_\ast(s, a) = \max_{\pi} q_\pi(s, a) \]


  • The optimal value function specifies the best possible performance in the MDP.

  • An MDP is “solved” when we know the optimal value function.

  • If you know \(q_\ast\), you have the optimal value function

  • So solving means finding \(q_\ast\)

Example: Optimal Value Function for Student MDP

Gives us value function for each state \(s\) (not how to behave)

Example: Optimal Action-Value Function for Student MDP

Gives us best action, \(a\), for each state \(s\) (can choose)

Optimal Policy

Define a partial ordering over policies

\[ \pi \;\geq\; \pi' \quad \text{if } v_\pi(s) \;\geq\; v_{\pi'}(s), \;\forall s \]

Theorem For any Markov Decision Process

  • There exists an optimal policy \(\pi_\ast\) that is better than or equal to all other policies, \(\pi_\ast \geq \pi, \;\forall \pi\)

  • All optimal policies achieve the optimal value function, \(v_{\pi_\ast}(s) = v_\ast(s)\)

  • All optimal policies achieve the optimal action-value function, \(q_{\pi_\ast}(s,a) = q_\ast(s,a)\)

Finding an Optimal Policy

An optimal policy can be found by maximising over \(q_\ast(s,a)\),

\[ \pi_\ast(a \mid s) = \begin{cases} 1 & \text{if } a = \underset{a \in \mathcal{A}}{\arg\max}\; q_\ast(s,a) \\[6pt] 0 & \text{otherwise} \end{cases} \]

  • There is always a deterministic optimal policy for any MDP
  • If we know \(q_\ast(s,a)\), we immediately have the optimal policy

Example: Optimal Policy for Student MDP

Red arcs (actions) represent optimal policy: picks highest \(q_\ast\)

Bellman Optimality Equation for \(v_\ast\) (look ahead)

The optimal value functions are recursively related by the Bellman optimality equations:

\[ v_\ast(s) = \max\limits_{a} q_\ast(s,a) \]

Working backwards using backup diagrams we get \(v_\ast(s)\) - best action over all policies taking max instead of average

Bellman Optimality Equation for \(Q^\ast\) (look ahead)

 

\[ q_\ast(s,a) = \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} \, v_\ast(s') \]

Considers where the environment might take us by averaging (looking ahead) and backing up (inductively)

Bellman Optimality Equation for \(V^* (2)\)

Bringing it together (two-step look ahead): agent actions (open circles), environment actions (closed circles)

\[ v_\ast(s) = \max\limits_{a} \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} \, v_\ast(s') \]

Bellman Optimality Equation for \(Q^* (2)\)

 

\[ q_\ast(s,a) = \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} \, \max\limits_{a'} q_\ast(s',a') \]

Determines \(Q^{\ast}\) reordering from environments perspective

Example: Bellman Optimality Equation in Student MDP

Compute \(v_{\ast}(s)\) for \(s=C1\) looking one step ahead (no environment actions in \(C1\))

Solving the Bellman Optimality Equation

Bellman Optimality Equation is non-linear

  • No closed form solution (in general)

Many iterative solution methods

  • Value Iteration (not covered in this subject)
  • Policy Iteration (not covered in this subject)
  • Q-learning (covered in Module 9)
  • SARSA (covered in Module 9)

Model-Based Reinforcement Learning


Last Module: Learning & Relxation

This Module: learn model directly from experience

  • \(\ldots\) and use planning to construct a value function or policy

Integrates learning and planning into a single architecture

Model-Based Reinforcement Learning

Model-Based and Model-Free RL

Model-Free RL

  • No model

  • Learn value function (and/or policy) from experience

Model-Based RL

  • Learn a model from experience

  • Plan value function (and/or policy) from model

Lookahead by planning (or thinking) about what the value function will be

Model-Free RL

Model-Based RL

Replace real world with the agent’s (simulated) model of the environment

  • Supports rollouts (lookaheads) under imagined actions to reason about what value function will be, without further environment interaction

Model-Based RL (2)

Advantages of Model-Based RL

Advantages:

  • Can efficiently learn model by supervised learning methods

  • Can reason about model uncertainty, and even take actions to reduce uncertainty

Disadvantages:

  • First learn a model, then construct a value function

\(\;\;\;\Rightarrow\) two sources of approximation error

Learning a Model

What is a Model?

A model \(\mathcal{M}\) is a representation of an MDP \(\langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R} \rangle\), parameterised by \(\eta\)

  • We will assume state space \(\mathcal{S}\) and action space \(\mathcal{A}\) are known

So a model \(\mathcal{M} = \langle \mathcal{P}_\eta, \mathcal{R}_\eta \rangle\) represents state transitions
\(\mathcal{P}_\eta \approx \mathcal{P}\) and rewards \(\mathcal{R}_\eta \approx \mathcal{R}\)

\[ S_{t+1} \sim \mathcal{P}_\eta(S_{t+1} \mid S_t, A_t) \]

\[ R_{t+1} = \mathcal{R}_\eta(R_{t+1} \mid S_t, A_t) \]


Typically assume conditional independence between state transitions and rewards

\[ \mathbb{P}[S_{t+1}, R_{t+1} \mid S_t, A_t] \;=\; \mathbb{P}[S_{t+1} \mid S_t, A_t] \; \mathbb{P}[R_{t+1} \mid S_t, A_t] \]

Note you can learn from each (one-step) transition, treating the following step as the supervisor for the prior step.

Model Learning

Goal: estimate model \(\mathcal{M}_\eta\) from experience \(\{S_1, A_1, R_2, \ldots, S_T\}\)

This is a supervised learning problem

\[ \begin{aligned} S_1, A_1 &\;\to\; R_2, S_2 \\ S_2, A_2 &\;\to\; R_3, S_3 \\ &\;\vdots \\ S_{T-1}, A_{T-1} &\;\to\; R_T, S_T \end{aligned} \]

  • Learning \(s,a \to r\) is a regression problem

  • Learning \(s,a \to s'\) is a density estimation problem

  • Pick loss function, e.g. mean-squared error, KL divergence, …

  • Find parameters \(\eta\) that minimise empirical loss

Examples of Models

  • Table Lookup Model

  • Linear Expectation Model

  • Linear Gaussian Model

  • Gaussian Process Model

  • Deep Belief Network Model

  • \(\cdots\) almost any supervised learning model

Table Lookup Model

Model is an explicit MDP, \(\hat{\mathcal{P}}, \hat{\mathcal{R}}\)

  • Count visits \(N(s,a)\) to each state–action pair (parametric approach)

\[ \begin{align*} \hat{\mathcal{P}}^{a}_{s,s'} & = \frac{1}{N(s,a)} \sum_{t=1}^T \mathbf{1}(S_t, A_t, S_{t+1} = s, a, s')\\[0pt] \hat{\mathcal{R}}^{a}_{s} & = \frac{1}{N(s,a)} \sum_{t=1}^T \mathbf{1}(S_t, A_t = s, a)\, R_t \end{align*} \]

  • Alternatively (a simple non-parametric approach)
    • At each time-step \(t\), record experience tuple \(\langle S_t, A_t, R_{t+1}, S_{t+1} \rangle\)
    • To sample model, randomly pick tuple matching \(\langle s,a,\cdot,\cdot \rangle\)

AB Example (Revisited) - Building a Model

Two states \(A,B\); no discounting; \(8\) episodes of experience

\(A, 0, B, 0\)

\(B, 1\)

\(B, 1\)

\(B, 1\)

\(B, 1\)

\(B, 1\)

\(B, 1\)

\(B, 0\)

We have constructed a table lookup model from the experience