Markov decision processes formally describe an environment for reinforcement learning
Where the environment is fully observable -i.e. The current state completely characterises the process
Almost all RL problems can be formalised as MDPs, e.g.
A Markov decision process (MDP) is a Markov reward process with decisions.
It is an environment in which all states are Markov.
We introduced agency in terms of actions.
Definition
A Markov Decision Process is a tuple \(<\mathcal{S}, \mathcal{\textcolor{red}{A}}, \mathcal{P}, \mathcal{R}, \gamma>\)
\(\mathcal{S}\) is a finite set of states
\(\mathcal{\color{red}{A}}\) is a finite set of actions
\(\mathcal{P}\) is a state transition probability matrix, \(P^{\textcolor{red}{a}}_{ss'} = \mathbb{P} [S_{t+1} = s'\ | S_t = s, A_t = \textcolor{red}{a}]\)
\(\mathcal{R}\) is a reward function, \(\mathcal{R}^{\textcolor{red}{a}}_s = \mathbb{E}[R_{t+1}\ |\ S_t = s, A_t = {\textcolor{red}{a}}]\)
\(\gamma\) is a discount factor \(\gamma \in [0,1]\).
Agent exerts control over MDP via actions, and goal is to find the best path through decision making process to maximise rewards
Definition
A (stochastic) policy \(\pi\) is a distribution over actions given states,
\[ \pi(a \mid s) = \mathbb{P}\!\left[\, A_t = a \;\middle|\; S_t = s \,\right] \]
A policy fully defines the behaviour of an agent
MDP policies depend on the current state (not the history)
i.e. Policies are stationary (time-independent), \(A_t \sim \pi(\,\cdot \mid S_t), \quad \forall t > 0\)
Given an MDP \(\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \rangle\) and a policy \(\pi\)
The state sequence \(S_1, S_2, \ldots\) is a Markov process \(\langle \mathcal{S}, \mathcal{P}^\pi \rangle\)
The state and reward sequence \(S_1, R_2, S_2, \ldots\) is a Markov reward process \(\langle \mathcal{S}, \mathcal{P}^\pi, \mathcal{R}^\pi, \gamma \rangle\), where
\[ \mathcal{P}^{\pi}_{s,s'} = \sum_{a \in \mathcal{A}} \pi(a \mid s)\, \mathcal{P}^{a}_{s s'} \]
\[ \mathcal{R}^{\pi}_{s} = \sum_{a \in \mathcal{A}} \pi(a \mid s)\, \mathcal{R}^{a}_{s} \]
Definition
The state-value function \(v_\pi(s)\) of an MDP is the expected return starting from state \(s\), and then following policy \(\pi\)
\[ v_\pi(s) = \mathbb{E}_\pi \!\left[\, G_t \;\middle|\; S_t = s \,\right] \]
Definition
The action-value function \(q_\pi(s, a)\) is the expected return starting from state \(s\), taking action \(a\), and then following policy \(\pi\)
\[ q_\pi(s, a) = \mathbb{E}_\pi \!\left[\, G_t \;\middle|\; S_t = s,\, A_t = a \,\right] \]
The state-value function can again be decomposed into immediate reward plus discounted value of successor state,
\[ v_\pi(s) = \mathbb{E}_\pi \!\left[\, R_{t+1} + \gamma v_\pi(S_{t+1}) \;\middle|\; S_t = s \,\right] \]
Can do the same thing for the \(q\) values: the action-value function can similarly be decomposed,
\[ q_\pi(s, a) = \mathbb{E}_\pi \!\left[\, R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) \;\middle|\; S_t = s,\, A_t = a \,\right] \]
\[ v_\pi(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s)\, q_\pi(s, a) \]
\[ q_\pi(s, a) = \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{s s'}\, v_\pi(s') \]
Bringing it together: agent actions (open circles), environment actions (closed circles)
\[ v_\pi(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) \left( \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{s s'} \, v_\pi(s') \right) \]
The other way around: can do same thing for action values
\[ q_\pi(s, a) = \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{s s'} \sum_{a' \in \mathcal{A}} \pi(a' \mid s') \, q_\pi(s', a') \]
In both forms value function is (recursively) equal to reward of immediate state \(s\) + value \(s'\) (where you end up)
Verify Bellman Equation to compute \(v_{\pi}(s)\) for \(s=C3\)
The Bellman expectation equation can be expressed concisely using the induced MRP (as before),
\[ v_\pi = \mathcal{R}^\pi + \gamma \mathcal{P}^\pi v_\pi \]
with direct solution
\[ v_\pi = (I - \gamma \mathcal{P}^\pi)^{-1} \mathcal{R}^\pi \]
Bellman equation gives us description of system can solve
Essentially averaging then computing inverse, although inefficient!
Definition
The optimal state-value function \(v_\ast(s)\) is the maximum value function over all policies
\[ v_\ast(s) = \max_{\pi} v_\pi(s) \]
The optimal action-value function \(q_\ast(s, a)\) is the maximum action-value function over all policies
\[ q_\ast(s, a) = \max_{\pi} q_\pi(s, a) \]
The optimal value function specifies the best possible performance in the MDP.
An MDP is “solved” when we know the optimal value function.
If you know \(q_\ast\), you have the optimal value function
So solving means finding \(q_\ast\)
Gives us value function for each state \(s\) (not how to behave)
Gives us best action, \(a\), for each state \(s\) (can choose)
Define a partial ordering over policies
\[ \pi \;\geq\; \pi' \quad \text{if } v_\pi(s) \;\geq\; v_{\pi'}(s), \;\forall s \]
Theorem For any Markov Decision Process
There exists an optimal policy \(\pi_\ast\) that is better than or equal to all other policies, \(\pi_\ast \geq \pi, \;\forall \pi\)
All optimal policies achieve the optimal value function, \(v_{\pi_\ast}(s) = v_\ast(s)\)
All optimal policies achieve the optimal action-value function, \(q_{\pi_\ast}(s,a) = q_\ast(s,a)\)
An optimal policy can be found by maximising over \(q_\ast(s,a)\),
\[ \pi_\ast(a \mid s) = \begin{cases} 1 & \text{if } a = \underset{a \in \mathcal{A}}{\arg\max}\; q_\ast(s,a) \\[6pt] 0 & \text{otherwise} \end{cases} \]
Red arcs (actions) represent optimal policy: picks highest \(q_\ast\)
The optimal value functions are recursively related by the Bellman optimality equations:
\[ v_\ast(s) = \max\limits_{a} q_\ast(s,a) \]
Working backwards using backup diagrams we get \(v_\ast(s)\) - best action over all policies taking max instead of average
\[ q_\ast(s,a) = \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} \, v_\ast(s') \]
Considers where the environment might take us by averaging (looking ahead) and backing up (inductively)
Bringing it together (two-step look ahead): agent actions (open circles), environment actions (closed circles)
\[ v_\ast(s) = \max\limits_{a} \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} \, v_\ast(s') \]
\[ q_\ast(s,a) = \mathcal{R}^a_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} \, \max\limits_{a'} q_\ast(s',a') \]
Determines \(Q^{\ast}\) reordering from environments perspective
Compute \(v_{\ast}(s)\) for \(s=C1\) looking one step ahead (no environment actions in \(C1\))
Bellman Optimality Equation is non-linear
Many iterative solution methods
Last Module: Learning & Relxation
This Module: learn model directly from experience
Integrates learning and planning into a single architecture
Model-Free RL
No model
Learn value function (and/or policy) from experience
Model-Based RL
Learn a model from experience
Plan value function (and/or policy) from model
Lookahead by planning (or thinking) about what the value function will be
Replace real world with the agent’s (simulated) model of the environment
Advantages:
Can efficiently learn model by supervised learning methods
Can reason about model uncertainty, and even take actions to reduce uncertainty
Disadvantages:
\(\;\;\;\Rightarrow\) two sources of approximation error
A model \(\mathcal{M}\) is a representation of an MDP \(\langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R} \rangle\), parameterised by \(\eta\)
So a model \(\mathcal{M} = \langle \mathcal{P}_\eta, \mathcal{R}_\eta \rangle\) represents state transitions
\(\mathcal{P}_\eta \approx \mathcal{P}\) and rewards \(\mathcal{R}_\eta \approx \mathcal{R}\)
\[ S_{t+1} \sim \mathcal{P}_\eta(S_{t+1} \mid S_t, A_t) \]
\[ R_{t+1} = \mathcal{R}_\eta(R_{t+1} \mid S_t, A_t) \]
Typically assume conditional independence between state transitions and rewards
\[ \mathbb{P}[S_{t+1}, R_{t+1} \mid S_t, A_t] \;=\; \mathbb{P}[S_{t+1} \mid S_t, A_t] \; \mathbb{P}[R_{t+1} \mid S_t, A_t] \]
Note you can learn from each (one-step) transition, treating the following step as the supervisor for the prior step.
Goal: estimate model \(\mathcal{M}_\eta\) from experience \(\{S_1, A_1, R_2, \ldots, S_T\}\)
This is a supervised learning problem
\[ \begin{aligned} S_1, A_1 &\;\to\; R_2, S_2 \\ S_2, A_2 &\;\to\; R_3, S_3 \\ &\;\vdots \\ S_{T-1}, A_{T-1} &\;\to\; R_T, S_T \end{aligned} \]
Learning \(s,a \to r\) is a regression problem
Learning \(s,a \to s'\) is a density estimation problem
Pick loss function, e.g. mean-squared error, KL divergence, …
Find parameters \(\eta\) that minimise empirical loss
Table Lookup Model
Linear Expectation Model
Linear Gaussian Model
Gaussian Process Model
Deep Belief Network Model
\(\cdots\) almost any supervised learning model
Model is an explicit MDP, \(\hat{\mathcal{P}}, \hat{\mathcal{R}}\)
\[ \begin{align*} \hat{\mathcal{P}}^{a}_{s,s'} & = \frac{1}{N(s,a)} \sum_{t=1}^T \mathbf{1}(S_t, A_t, S_{t+1} = s, a, s')\\[0pt] \hat{\mathcal{R}}^{a}_{s} & = \frac{1}{N(s,a)} \sum_{t=1}^T \mathbf{1}(S_t, A_t = s, a)\, R_t \end{align*} \]
Two states \(A,B\); no discounting; \(8\) episodes of experience
\(A, 0, B, 0\)
\(B, 1\)
\(B, 1\)
\(B, 1\)
\(B, 1\)
\(B, 1\)
\(B, 1\)
\(B, 0\)
We have constructed a table lookup model from the experience
Satisficing Planning
Input: A planning task \(P\).
Output: A plan for \(P\), or unsolvable if no plan for \(P\) exists.
Optimal Planning
Input: A planning task \(P\).
Output: An optimal plan for \(P\), or unsolvable if no plan for \(P\) exists.
\(\rightarrow\) The techniques successful for either one of these are almost disjoint!
\(\rightarrow\) Satisficing planning is much more effective in practice
\(\rightarrow\) Programs solving these problems are called planners, planning systems, or planning tools.
Definition (PlanEx). By PlanEx, we denote the problem of deciding, given a planning task \(P\), whether or not there exists a plan for \(P\).
\(\rightarrow\) Corresponds to satisficing planning.
Definition (PlanLen). By PlanLen, we denote the problem of deciding, given a planning task \(P\) and an integer \(B\), whether or not there exists a plan for \(P\) of length at most \(B\).
\(\rightarrow\) Corresponds to optimal planning.
NP: Decision problems for which there exists a non-deterministic Turing machine that runs in time polynomial in the size of its input.
PSPACE: Decision problems for which there exists a deterministic Turing machine that runs in space polynomial in the size of its input.
Relationship between classes: Non-deterministic polynomial space can be simulated in deterministic polynomial space.
For details see a computational complexity textbook, such as Garey and Johnson (1979).
Theorem. PlanEx and PlanLen are PSPACE-complete.


| blocks | states |
|---|---|
| 1 | 1 |
| 2 | 3 |
| 3 | 13 |
| 4 | 73 |
| 5 | 501 |
| 6 | 4051 |
| 7 | 37633 |
| 8 | 394353 … |
| blocks | states |
|---|---|
| 9 | 4596553 |
| 10 | 58941091 |
| 11 | 824073141 |
| 12 | 12470162233 |
| 13 | 202976401213 |
| 14 | 3535017524403 |
| 15 | 65573803186921 |
| 16 | 1290434218669921 |
\(\rightarrow\) State spaces may be huge. In particular, the state space is typically exponentially large in the size of its specification via the problem \(\Pi\) (up next).
\(\rightarrow\) In other words: search problems typically are computationally hard (e.g., optimal Blocksworld solving is NP-complete).
How do we know that value iteration converges to \(v_*\)?
Or that iterative policy evaluation converges to \(v_{\pi}\)?
And therefore that policy iteration converges to \(v_*\)?
Is the solution unique?
How fast do these algorithms converge?
These questions are resolved by contraction mapping theorem
Consider the vector space \(\mathcal{V}\) over value functions
There are \(|\mathcal{S}|\) dimensions
Each point in this space fully specifies a value function \(v(s)\)
What does a Bellman backup do to points in this space?
We will show that it brings value functions closer,
\(\ldots\) and therefore the backups must converge on a unique solution
We will measure distance between state-value functions \(u\) and \(v\) by the \(\infty\)-norm
i.e. the largest difference between state values,
\[ \|u - v\|_{\infty} = \max_{s \in \mathcal{S}} |u(s) - v(s)| \]
Define the Bellman expectation backup operator \(T^{\pi}\)
\[ T^{\pi}(v) = \mathcal{R}^{\pi} + \gamma \mathcal{P}^{\pi} v \]
This operator is a \(\gamma\)-contraction, i.e. it makes value functions closer by at least \(\gamma\):
\[ \begin{align*} \|T^{\pi}(u) - T^{\pi}(v)\|_{\infty} &= \|(\mathcal{R}^{\pi} + \gamma \mathcal{P}^{\pi} u) - (\mathcal{R}^{\pi} + \gamma \mathcal{P}^{\pi} v)\|_{\infty} \\[4pt] &= \|\gamma \mathcal{P}^{\pi} (u - v)\|_{\infty} \\[4pt] &\le \|\gamma \mathcal{P}^{\pi}\|_{\infty} \, \|u - v\|_{\infty} \\[4pt] &\le \gamma \, \|u - v\|_{\infty} \end{align*} \]
Theorem (Contraction Mapping Theorem)
For any metric space \(\mathcal{V}\) that is complete (i.e. closed) under an operator \(T(v)\), where \(T\) is a \(\gamma\)-contraction,
\(T\) converges to a unique fixed point
At a linear convergence rate of \(\gamma\)
The Bellman expectation operator \(T^{\pi}\) has a unique fixed point
\(v_{\pi}\) is a fixed point of \(T^{\pi}\) (by the Bellman expectation equation)
By the contraction mapping theorem
Iterative policy evaluation converges on \(v_{\pi}\)
Policy iteration converges on \(v_{*}\)
Define the Bellman optimality backup operator \(T^{*}\),
\[ T^{*}(v) = \max_{a \in \mathcal{A}} \left( \mathcal{R}^{a} + \gamma \mathcal{P}^{a} v \right) \]
This operator is a \(\gamma\)-contraction,
i.e. it makes value functions closer by at least \(\gamma\)
(similar to previous proof)
\[ \|T^{*}(u) - T^{*}(v)\|_{\infty} \le \gamma \|u - v\|_{\infty} \]
The Bellman optimality operator \(T^*\) has a unique fixed point
\(v_*\) is a fixed point of \(T^*\) (by Bellman optimality equation)
By contraction mapping theorem
Value iteration converges on \(v_*\)
When \(\lambda = 0\), only the current state is updated
\[ \begin{align*} E_t(s) & = \mathbf{1}(S_t = s)\\[0pt] V(s) & \leftarrow V(s) + \alpha \delta_t E_t(s) \end{align*} \]
This is exactly equivalent to the TD(\(0\)) update
\[ V(S_t) \leftarrow V(S_t) + \alpha \delta_t \]
When \(\lambda = 1\), credit is deferred until the end of the episode
Consider episodic environments with offline updates
Over the course of an episode, total update for TD(\(1\)) is the same as total update for MC
Theorem
The sum of offline updates is identical for forward-view and backward-view TD(\(\lambda\))
\[ \sum_{t=1}^{T} \alpha \delta_t E_t(s) = \sum_{t=1}^{T} \alpha \left( G_t^{\lambda} - V(S_t) \right) \mathbf{1}(S_t = s) \]
Consider an episode where \(s\) is visited once at time-step \(k\)
\[ \begin{align*} E_t(s) & = \gamma E_{t-1}(s) + \mathbf{1}(S_t = s)\\[0pt] & = \begin{cases} 0, & \text{if } t < k \\ \gamma^{t-k}, & \text{if } t \ge k \end{cases} \end{align*} \]
\[ \sum_{t=1}^{T-1} \alpha \delta_t E_t(s) = \alpha \sum_{t=k}^{T-1} \gamma^{t-k} \delta_t = \alpha (G_k - V(S_k)) \]
\[ \delta_k + \gamma \delta_{k+1} + \gamma^2 \delta_{k+2} + \dots + \gamma^{T-1-k} \delta_{T-1} \]
When \(\lambda = 1\), the sum of TD errors telescopes into the Monte Carlo (MC) error:
\[ \begin{align*} & \delta_t + \gamma \delta_{t+1} + \gamma^{2}\delta_{t+2} + \dots + \gamma^{T-1-t}\delta_{T-1}\\[0pt] & =\; R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\\[0pt] & +\; \gamma R_{t+2} + \gamma^{2} V(S_{t+2}) - \gamma V(S_{t+1})\\[0pt] & +\; \gamma^{2} R_{t+3} + \gamma^{3} V(S_{t+3}) - \gamma^{2} V(S_{t+2})\\[0pt] & \quad \quad \quad \vdots\\[0pt] & +\; \gamma^{T-1-t} R_T + \gamma^{T-t} V(S_T) - \gamma^{T-1-t} V(S_{T-1})\\[0pt] & =\; R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + \dots + \gamma^{T-1-t} R_T - V(S_t)\\[0pt] & =\; G_t - V(S_t) \end{align*} \]
TD(\(1\)) is roughly equivalent to every-visit Monte-Carlo
Error is accumulated online, step-by-step
If value function is only updated online at end of episode
Then total update is exactly the same as MC
For general \(\lambda\), TD errors also telescope to the \(\lambda\)-error, \(G_t^{\lambda} - V(S_t)\)
\[ \begin{align*} & G_t^{\lambda}-V(S_t)\hspace{-15mm} &=-V(S_t) &+(1-\lambda)\lambda^{0}(R_{t+1} + \gamma V(S_{t+1}))\\[0pt] & & &+(1-\lambda)\lambda^{1}(R_{t+1}+\gamma R_{t+2}+\gamma^{2} V(S_{t+2})) \\[0pt] & & & + (1 - \lambda)\lambda^{2}(R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + \gamma^{3} V(S_{t+3}))\\[0pt] & & &+ \dots \\[0pt] & &= -V(S_t) &+(\gamma\lambda)^{0}(R_{t+1}+\gamma V(S_{t+1})-\gamma\lambda V(S_{t+1})) \\[0pt] & & & + (\gamma\lambda)^{1}(R_{t+2} + \gamma V(S_{t+2}) - \gamma\lambda V(S_{t+2})) \\[0pt] & & & + (\gamma\lambda)^{2}(R_{t+3} + \gamma V(S_{t+3}) - \gamma\lambda V(S_{t+3})) \\[0pt] & & & + \dots \\[0pt] & &= \quad \quad \quad \quad & \quad (\gamma\lambda)^{0}(R_{t+1} + \gamma V(S_{t+1}) - V(S_t)) \\[0pt] & & & + (\gamma\lambda)^{1}(R_{t+2} + \gamma V(S_{t+2}) - V(S_{t+1})) \\[0pt] & & & + (\gamma\lambda)^{2}(R_{t+3} + \gamma V(S_{t+3}) - V(S_{t+2})) \\[0pt] & & & + \dots \\[0pt] & & & \hspace{-37mm} = \delta_t + \gamma\lambda \delta_{t+1} + (\gamma\lambda)^2 \delta_{t+2} + \dots \end{align*} \]
Consider an episode where \(s\) is visited once at time-step \(k\)
TD(\(\lambda\)) eligibility trace discounts time since visit:
\[ \begin{align*} E_t(s) & = \gamma \lambda E_{t-1}(s) + \mathbf{1}(S_t = s)\\[0pt] & = \begin{cases} 0, & \text{if } t < k \\ (\gamma \lambda)^{t-k}, & \text{if } t \ge k \end{cases} \end{align*} \]
Backward TD(\(\lambda\)) updates accumulate error online:
\[ \sum_{t=1}^{T} \alpha \delta_t E_t(s) = \alpha \sum_{t=k}^{T} (\gamma \lambda)^{t-k} \delta_t = \alpha \left(G_k^{\lambda} - V(S_k)\right) \]
By the end of the episode, it accumulates total error for the \(\lambda\)-return
For multiple visits to \(s\), \(E_t(s)\) accumulates many errors
Offline updates
Updates are accumulated within episode
but applied in batch at the end of episode
Online updates
TD(\(\lambda\)) updates are applied online at each step within episode
Forward and backward-view TD(\(\lambda\)) are slightly different
Exact online TD(\(\lambda\)) achieves perfect equivalence
By using a slightly different form of eligibility trace
Harm van Seijen, A. Rupam Mahmood, Patrick M. Pilarski, Marlos C. Machado, Richard S. Sutton; True Online Temporal-Difference Learning. Journal of Machine Learning Research (JMLR), 17(145):1−40, 2016.
\[ \begin{array}{|c|c|c|c|} \hline {\color{red}{\text{Offline updates}}} & {\color{red}{\lambda = 0}} & {\color{red}{\lambda \in (0,1)}} & {\color{red}{\lambda = 1}} \\ \hline \text{Backward view} & \text{TD(0)} & \text{TD}(\lambda) & \text{TD(1)} \\[0pt] & || & || & || \\[0pt] \text{Forward view} & \text{TD(0)} & \text{Forward TD}(\lambda) & \text{MC} \\[0pt] \hline {\color{red}{\text{Online updates}}} & {\color{red}{\lambda = 0}} & {\color{red}{\lambda \in (0,1)}} & {\color{red}{\lambda = 1}} \\ \hline \text{Backward view} & \text{TD(0)} & \text{TD}(\lambda) & \text{TD(1)} \\[0pt] & || & \neq & \neq \\[0pt] \text{Forward view} & \text{TD(0)} & \text{Forward TD}(\lambda) & \text{MC} \\[0pt] & || & || & || \\[0pt] \text{Exact Online} & \text{TD(0)} & \text{Exact Online TD}(\lambda) & \text{Exact Online TD(1)} \\ \hline \end{array} \]
= here indicates equivalence in total update at end of episode.
Experience replay finds the least squares solution
Using the special case of linear value function approximation
\[ \hat{v}(s,\mathbf{w}) = x(s)^\top \mathbf{w} \]
We can solve the least squares solution directly using a closed form
At the minimum of \(LS(\mathbf{w})\), the expected update must be zero:
\[\begin{align*} \mathbb{E}_{\mathcal{D}} \left[ \Delta \mathbf{w} \right] &= 0\;\;\; \text{(want update zero across data set)} \\[0pt] \alpha \sum_{t=1}^{T} \mathbf{x}(s_t) \bigl( v_t^{\pi} - \mathbf{x}(s_t)^{\top} \mathbf{w} \bigr) &= 0\;\;\; \text{(unwrapping expected updates)}\\[0pt] \sum_{t=1}^{T} \mathbf{x}(s_t) v_t^{\pi} &= \sum_{t=1}^{T} \mathbf{x}(s_t) \mathbf{x}(s_t)^{\top} \mathbf{w} \\[0pt] \mathbf{w} &= \left( \sum_{t=1}^{T} \mathbf{x}(s_t) \mathbf{x}(s_t)^{\top} \right)^{-1} \sum_{t=1}^{T} \mathbf{x}(s_t) v_t^{\pi} \end{align*}\]
Compute time: \(O(N^3)\) for \(N\) features (direct),
or incremental \(O(N^2)\) using Sherman–Morrison.
We do not know true values \(v_t^\pi\)
\[\begin{align*} {\color{blue}{\text{LSMC}}} \;\;\;\;&\text{Least Squares Monte-Carlo uses return}\\[0pt] & v_t^\pi \approx {\color{red}{G_t}} \\[0pt] {\color{blue}{\text {LSTD}}} \;\;\;\; & \text{Least Squares Temporal-Difference uses TD target}\\[0pt] & v_t^\pi \approx {\color{red}{R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w})}}\\[0pt] {\color{blue}{\text{LSTD}}}{\color{blue}{(\lambda)}}\;\;\;\; & \text{Least Squares TD}(\lambda)\text{ uses}\;\lambda\text{-return} \\[0pt] & v_t^\pi \approx {\color{red}{G_t^\lambda}} \end{align*}\]
In each case we can solve directly for the fixed point of MC / TD / TD(\(\lambda\))
\[\begin{align*} {\color{red}{\text{LSMC}}} \;\;\;\; 0 & = \sum_{t=1}^T \alpha \big(G_t - \hat{v}(S_t,\mathbf{w})\big)\, \textbf{x}(S_t),\qquad\\[0pt] \mathbf{w} & = \left(\sum_{t=1}^T \textbf{x}(S_t) \textbf{x}(S_t)^\top \right)^{-1} \left(\sum_{t=1}^T \textbf{x}(S_t) G_t \right)\\[0pt] {\color{red}{\text{LSTD}}} \;\;\;\; 0 & = \sum_{t=1}^T \alpha \big(R_{t+1} + \gamma \hat{v}(S_{t+1},\mathbf{w}) - \hat{v}(S_t,\mathbf{w})\big)\, \textbf{x}(S_t)\\[0pt] \mathbf{w} & = \left(\sum_{t=1}^T \textbf{x}(S_t)\big(x(S_t) - \gamma \textbf{x}(S_{t+1})\big)^\top \right)^{-1} \left(\sum_{t=1}^T \textbf{x}(S_t) R_{t+1}\right)\\[0pt] {\color{red}{\text{LSTD}}}{\color{red}{(\lambda)}}\;\;\;\; 0 & = \sum_{t=1}^T \alpha\, \delta_t\, E_t,\qquad\\[0pt] \mathbf{w} & = \left(\sum_{t=1}^T E_t \big(\textbf{x}(S_t) - \gamma \textbf{x}(S_{t+1})\big)^\top \right)^{-1} \left(\sum_{t=1}^T E_t R_{t+1}\right) \end{align*}\]
\[ \begin{array}{l l c c c} \hline \text{On/Off-Policy} & \text{Algorithm} & \text{Table Lookup} & \text{Linear} & \text{Non-Linear} \\ \hline \text{On-Policy} & \textit{MC} & \checkmark & \checkmark & \checkmark \\ & {\color{red}{\textit{LSMC}}} & {\color{red}{\checkmark}} & {\color{red}{\checkmark}} & {\color{red}{-}} \\ & \textit{TD} & \checkmark & \checkmark & \text{✗} \\ & {\color{red}{\textit{LSTD}}} & {\color{red}{\checkmark}} & {\color{red}{\checkmark}} & {\color{red}{-}} \\ \hline \text{Off-Policy} & \textit{MC} & \checkmark & \checkmark & \checkmark \\ & {\color{red}{\textit{LSMC}}} & {\color{red}{\checkmark}} & {\color{red}{\checkmark}} & {\color{red}{-}} \\ & \textit{TD} & \checkmark & \text{✗} & \text{✗} \\ & {\color{red}{\textit{LSTD}}} & {\color{red}{\checkmark}} & {\color{red}{\checkmark}} & {\color{red}{-}} \\ \hline \end{array} \]
Policy evaluation Policy evaluation by least-squares Q-learning
Policy improvement Greedy policy improvement
Approximate action-value function \(q_\pi(s, a)\)
\[ \hat{q}(s, a, \mathbf{w}) = \mathbf{x}(s, a)^\top \mathbf{w} \;\approx\; q_\pi(s, a) \]
Minimise least squares error between \(\hat{q}(s, a, \mathbf{w})\) and \(q_\pi(s, a)\)
from experience generated using policy \(\pi\)
consisting of \(\langle (state, action), value \rangle\) pairs
\[ \mathcal{D} = \Bigl\{ \langle (s_1, a_1), v_1^\pi \rangle,\; \langle (s_2, a_2), v_2^\pi \rangle,\; \ldots,\; \langle (s_T, a_T), v_T^\pi \rangle \Bigr\} \]
For policy evaluation, we want to efficiently use all experience
For control, we also want to improve the policy
The experience is generated from many policies
So to evaluate \(q_\pi(S,A)\) we must learn off-policy
We use the same idea as Q-learning:
Consider the following linear Q-learning update
\[\begin{align*} \delta &= R_{t+1} + \gamma \hat{q}(S_{t+1}, {\color{red}{\pi(S_{t+1})}}, \mathbf{w}) - \hat{q}(S_t, A_t, \mathbf{w}) \\[0.5em] \Delta \mathbf{w} &= \alpha \, \delta \, \mathbf{x}(S_t, A_t) \end{align*}\]
LSTDQ algorithm: solve for total update \(=\) zero
\[\begin{align*} 0 &= \sum_{t=1}^T \alpha \Big( R_{t+1} + \gamma \hat{q}(S_{t+1}, {\color{red}{\pi(S_{t+1})}}, \mathbf{w}) - \hat{q}(S_t, A_t, \mathbf{w}) \Big)\mathbf{x}(S_t, A_t) \\[1em] \mathbf{w} &= \Bigg( \sum_{t=1}^T \mathbf{x}(S_t, A_t)\big(\mathbf{x}(S_t, A_t) - \gamma \mathbf{x}(S_{t+1}, \pi(S_{t+1}))\big)^\top \Bigg)^{-1} \sum_{t=1}^T \mathbf{x}(S_t, A_t) R_{t+1} \end{align*}\]
The following pseudo-code uses LSTDQ for policy evaluation
It repeatedly re-evaluates experience \(\mathcal{D}\) with different policies
Least Squares Policy Iteration Algorithm (LSTFQ variant)
\[ \begin{aligned} &\text{function LSPI-TD}(\mathcal{D}, \pi_0) \\ &\quad \pi^{\prime} \leftarrow \pi_0 \\ &\quad \text{repeat} \\ &\qquad \pi \leftarrow \pi^{\prime} \\ &\qquad Q \leftarrow \text{LSTDQ}(\pi,\mathcal{D}) \\ &\qquad \text{for all } s \in S \text{ do} \\ &\qquad\quad \pi^{\prime}(s) \leftarrow \arg\max_{a \in A} Q(s,a) \\ &\qquad \text{end for} \\ &\quad \text{until } (\pi \approx \pi^{\prime}) \\ &\quad \text{return } \pi \\ &\text{end function} \end{aligned} \]
\[ \begin{array}{l l c c c} \hline \text{Algorithm} & \text{Table Lookup} & \text{Linear} & \text{Non-Linear} \\ \hline \textit{Monte-Carlo Control} & \checkmark & (\checkmark) & \text{✗} \\ \textit{Sarsa} & \checkmark & (\checkmark) & \text{✗} \\ \textit{Q-Learning} & \checkmark & \text{✗} & \text{✗} \\ \textit{\color{red}{LSPI}} & \checkmark & (\checkmark) & - \\ \hline \end{array} \]
(\(\checkmark\)) = chatters around near-optimal value function.
Consider the 50 state version of this problem (bigger replica of this diagram)
Reward \(+1\) in states \(10\) and \(41\), \(0\) elsewhere
Optimal policy: R (\(1-9\)), L (\(10-25\)), R (\(26-41\)), L (\(42, 50\))
Features: \(10\) evenly spaced Gaussians (\(\sigma = 4\)) for each action
Experience: \(10,000\) steps from random walk policy
Plots show LSPI iterations on a 50-state chain with a radial basis function approximator.
The colours represent two different actions of going left (blue) or right (red)
The plots show the true value function (dashed) or approximate value function
You can see it converges to the optimal policy after only \(7\) iterations.
Plots of policy improvement over LSPI iterations (using same 50-state chain example).
Gradient ascent algorithms can follow any ascent direction
Also, a policy can often be reparametrised without changing action probabilities
For example, increasing score of all actions in a softmax policy
The vanilla gradient is sensitive to these reparametrisations
Policy Gradient with a Learned Critic
We want to estimate the true gradient \[ \nabla_\theta J(\theta) = \mathbb{E}_\pi \!\left[ \nabla_\theta \log \pi_\theta(a|s)\, Q^\pi(s,a) \right] \]
But \(Q^\pi(s,a)\) is unknown — so we approximate it with a parametric critic \({\color{red}{Q_w(s,a)}}\)
Linear in the policy’s score function \[ Q_w(s,a) = w^\top \nabla_\theta \log \pi_\theta(a|s) \]
Weights chosen to minimise on-policy error \[ w^* = \arg\min_w \mathbb{E}_\pi \left[ \big(Q^\pi(s,a) - Q_w(s,a)\big)^2 \right] \]
If both compatibility conditions hold, then
\[ \nabla_\theta J(\theta) = \mathbb{E}_\pi \left[ \nabla_\theta \log \pi_\theta(a|s)\, Q_w(s,a) \right] \]
That is — the approximate critic yields the exact policy gradient.
Intuition
The critic lies in the same space as the policy’s gradient features
The actor and critic share a common geometry
The critic projects \(Q^\pi\) onto the policy’s tangent space
Variance reduction without bias — a perfect harmony between learning and estimation
The natural policy gradient is parametrisation independent
\[ \nabla^{\text{nat}}_{\theta} \pi_{\theta}(s,a) = G_{\theta}^{-1} \nabla_{\theta} \pi_{\theta}(s,a) \]
Uses compatible function approximation,
\[ \nabla_{\mathbf{w}} A_{\mathbf{w}}(s,a) = \nabla_{\theta} \log \pi_{\theta}(s,a) \]
So the natural policy gradient simplifies,
\[ \begin{align*} \nabla_{\theta} J(\theta) &= \mathbb{E}_{\pi_{\theta}} \big[ \nabla_{\theta} \log \pi_{\theta}(s,a) \; A^{\pi_{\theta}}(s,a) \big] \\[2pt] &= \mathbb{E}_{\pi_{\theta}} \Big[ \nabla_{\theta} \log \pi_{theta}(s,a) \, \nabla_{\theta} \log \pi_{\theta}(s,a)^\top \Big] \mathbf{w} \\[2pt] &= G_{\theta} \, \mathbf{w}\\[2pt] {\color{red}{\nabla_{\theta}^{\text{nat}} J(\theta)}} & {\color{red}{= \mathbf{w}}} \end{align*} \]
i.e. update actor parameters in direction of critic parameters \(\mathbf{w}\)