Markov decision processes formally describe an environment for reinforcement learning
Where the environment is fully observable -i.e. The current state completely characterises the process
Almost all RL problems can be formalised as MDPs, e.g.
A Markov decision process (MDP) is a Markov reward process with decisions.
It is an environment in which all states are Markov.
We introduced agency in terms of actions.
Definition
A Markov Decision Process is a tuple \(<\mathcal{S}, \mathcal{\textcolor{red}{A}}, \mathcal{P}, \mathcal{R}, \gamma>\)
\(\mathcal{S}\) is a finite set of states
\(\mathcal{\color{red}{A}}\) is a finite set of actions
\(\mathcal{P}\) is a state transition probability matrix, \(P^{\textcolor{red}{a}}_{ss'} = \mathbb{P} [S_{t+1} = s'\ | S_t = s, A_t = \textcolor{red}{a}]\)
\(\mathcal{R}\) is a reward function, \(\mathcal{R}^{\textcolor{red}{a}}_s = \mathbb{E}[R_{t+1}\ |\ S_t = s, A_t = {\textcolor{red}{a}}]\)
\(\gamma\) is a discount factor \(\gamma \in [0,1]\).
Agent exerts control over MDP via actions, and goal is to find the best path through decision making process to maximise rewards
Definition
A (stochastic) policy \(\pi\) is a distribution over actions given states,
\[ \pi(a \mid s) = \mathbb{P}\!\left[\, A_t = a \;\middle|\; S_t = s \,\right] \]
A policy fully defines the behaviour of an agent
MDP policies depend on the current state (not the history)
i.e. Policies are stationary (time-independent), \(A_t \sim \pi(\,\cdot \mid S_t), \quad \forall t > 0\)
Given an MDP \(\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \rangle\) and a policy \(\pi\)
The state sequence \(S_1, S_2, \ldots\) is a Markov process \(\langle \mathcal{S}, \mathcal{P}^\pi \rangle\)
The state and reward sequence \(S_1, R_2, S_2, \ldots\) is a Markov reward process \(\langle \mathcal{S}, \mathcal{P}^\pi, \mathcal{R}^\pi, \gamma \rangle\), where
\[ \mathcal{P}^{\pi}_{s,s'} = \sum_{a \in \mathcal{A}} \pi(a \mid s)\, \mathcal{P}^{a}_{s s'} \]
\[ \mathcal{R}^{\pi}_{s} = \sum_{a \in \mathcal{A}} \pi(a \mid s)\, \mathcal{R}^{a}_{s} \]
Definition
The state-value function \(v_\pi(s)\) of an MDP is the expected return starting from state \(s\), and then following policy \(\pi\)
\[ v_\pi(s) = \mathbb{E}_\pi \!\left[\, G_t \;\middle|\; S_t = s \,\right] \]
Definition
The action-value function \(q_\pi(s, a)\) is the expected return starting from state \(s\), taking action \(a\), and then following policy \(\pi\)
\[ q_\pi(s, a) = \mathbb{E}_\pi \!\left[\, G_t \;\middle|\; S_t = s,\, A_t = a \,\right] \]
The state-value function can again be decomposed into immediate reward plus discounted value of successor state,
\[ v_\pi(s) = \mathbb{E}_\pi \!\left[\, R_{t+1} + \gamma v_\pi(S_{t+1}) \;\middle|\; S_t = s \,\right] \]
Can do the same thing for the \(q\) values: the action-value function can similarly be decomposed,
\[ q_\pi(s, a) = \mathbb{E}_\pi \!\left[\, R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) \;\middle|\; S_t = s,\, A_t = a \,\right] \]
\[ v_\pi(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s)\, q_\pi(s, a) \]
\[ q_\pi(s, a) = \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{s s'}\, v_\pi(s') \]
Bringing it together: agent actions (open circles), environment actions (closed circles)
\[ v_\pi(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) \left( \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{s s'} \, v_\pi(s') \right) \]
The other way around: can do same thing for action values
\[ q_\pi(s, a) = \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{s s'} \sum_{a' \in \mathcal{A}} \pi(a' \mid s') \, q_\pi(s', a') \]
In both forms value function is (recursively) equal to reward of immediate state \(s\) + value \(s'\) (where you end up)
Verify Bellman Equation to compute \(v_{\pi}(s)\) for \(s=C3\)
The Bellman expectation equation can be expressed concisely using the induced MRP (as before),
\[ v_\pi = \mathcal{R}^\pi + \gamma \mathcal{P}^\pi v_\pi \]
with direct solution
\[ v_\pi = (I - \gamma \mathcal{P}^\pi)^{-1} \mathcal{R}^\pi \]
Bellman equation gives us description of system can solve
Essentially averaging then computing inverse, although inefficient!
Definition
The optimal state-value function \(v_\ast(s)\) is the maximum value function over all policies
\[ v_\ast(s) = \max_{\pi} v_\pi(s) \]
The optimal action-value function \(q_\ast(s, a)\) is the maximum action-value function over all policies
\[ q_\ast(s, a) = \max_{\pi} q_\pi(s, a) \]
The optimal value function specifies the best possible performance in the MDP.
An MDP is “solved” when we know the optimal value function.
If you know \(q_\ast\), you have the optimal value function
So solving means finding \(q_\ast\)
Gives us value function for each state \(s\) (not how to behave)
Gives us best action, \(a\), for each state \(s\) (can choose)
Define a partial ordering over policies
\[ \pi \;\geq\; \pi' \quad \text{if } v_\pi(s) \;\geq\; v_{\pi'}(s), \;\forall s \]
Theorem For any Markov Decision Process
There exists an optimal policy \(\pi_\ast\) that is better than or equal to all other policies, \(\pi_\ast \geq \pi, \;\forall \pi\)
All optimal policies achieve the optimal value function, \(v_{\pi_\ast}(s) = v_\ast(s)\)
All optimal policies achieve the optimal action-value function, \(q_{\pi_\ast}(s,a) = q_\ast(s,a)\)
An optimal policy can be found by maximising over \(q_\ast(s,a)\),
\[ \pi_\ast(a \mid s) = \begin{cases} 1 & \text{if } a = \underset{a \in \mathcal{A}}{\arg\max}\; q_\ast(s,a) \\[6pt] 0 & \text{otherwise} \end{cases} \]
Red arcs (actions) represent optimal policy: picks highest \(q_\ast\)
The optimal value functions are recursively related by the Bellman optimality equations:
\[ v_\ast(s) = \max\limits_{a} q_\ast(s,a) \]
Working backwards using backup diagrams we get \(v_\ast(s)\) - best action over all policies taking max instead of average
\[ q_\ast(s,a) = \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} \, v_\ast(s') \]
Considers where the environment might take us by averaging (looking ahead) and backing up (inductively)
Bringing it together (two-step look ahead): agent actions (open circles), environment actions (closed circles)
\[ v_\ast(s) = \max\limits_{a} \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} \, v_\ast(s') \]
\[ q_\ast(s,a) = \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} \, \max\limits_{a'} q_\ast(s',a') \]
Determines \(Q^{\ast}\) reordering from environments perspective
Compute \(v_{\ast}(s)\) for \(s=C1\) looking one step ahead (no environment actions in \(C1\))
Bellman Optimality Equation is non-linear
Many iterative solution methods
Last Module: Learning & Relxation
This Module: learn model directly from experience
Integrates learning and planning into a single architecture
Model-Free RL
No model
Learn value function (and/or policy) from experience
Model-Based RL
Learn a model from experience
Plan value function (and/or policy) from model
Lookahead by planning (or thinking) about what the value function will be
Replace real world with the agent’s (simulated) model of the environment
Advantages:
Can efficiently learn model by supervised learning methods
Can reason about model uncertainty, and even take actions to reduce uncertainty
Disadvantages:
\(\;\;\;\Rightarrow\) two sources of approximation error
A model \(\mathcal{M}\) is a representation of an MDP \(\langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R} \rangle\), parameterised by \(\eta\)
So a model \(\mathcal{M} = \langle \mathcal{P}_\eta, \mathcal{R}_\eta \rangle\) represents state transitions
\(\mathcal{P}_\eta \approx \mathcal{P}\) and rewards \(\mathcal{R}_\eta \approx \mathcal{R}\)
\[ S_{t+1} \sim \mathcal{P}_\eta(S_{t+1} \mid S_t, A_t) \]
\[ R_{t+1} = \mathcal{R}_\eta(R_{t+1} \mid S_t, A_t) \]
Typically assume conditional independence between state transitions and rewards
\[ \mathbb{P}[S_{t+1}, R_{t+1} \mid S_t, A_t] \;=\; \mathbb{P}[S_{t+1} \mid S_t, A_t] \; \mathbb{P}[R_{t+1} \mid S_t, A_t] \]
Note you can learn from each (one-step) transition, treating the following step as the supervisor for the prior step.
Goal: estimate model \(\mathcal{M}_\eta\) from experience \(\{S_1, A_1, R_2, \ldots, S_T\}\)
This is a supervised learning problem
\[ \begin{aligned} S_1, A_1 &\;\to\; R_2, S_2 \\ S_2, A_2 &\;\to\; R_3, S_3 \\ &\;\vdots \\ S_{T-1}, A_{T-1} &\;\to\; R_T, S_T \end{aligned} \]
Learning \(s,a \to r\) is a regression problem
Learning \(s,a \to s'\) is a density estimation problem
Pick loss function, e.g. mean-squared error, KL divergence, …
Find parameters \(\eta\) that minimise empirical loss
Table Lookup Model
Linear Expectation Model
Linear Gaussian Model
Gaussian Process Model
Deep Belief Network Model
\(\cdots\) almost any supervised learning model
Model is an explicit MDP, \(\hat{\mathcal{P}}, \hat{\mathcal{R}}\)
\[ \begin{align*} \hat{\mathcal{P}}^{a}_{s,s'} & = \frac{1}{N(s,a)} \sum_{t=1}^T \mathbf{1}(S_t, A_t, S_{t+1} = s, a, s')\\[0pt] \hat{\mathcal{R}}^{a}_{s} & = \frac{1}{N(s,a)} \sum_{t=1}^T \mathbf{1}(S_t, A_t = s, a)\, R_t \end{align*} \]
Two states \(A,B\); no discounting; \(8\) episodes of experience
\(A, 0, B, 0\)
\(B, 1\)
\(B, 1\)
\(B, 1\)
\(B, 1\)
\(B, 1\)
\(B, 1\)
\(B, 0\)
We have constructed a table lookup model from the experience