Appendix (Reference Topics)
Appendix (Reference Topics)
Slides
This module is also available in the following versions
Categorizing agents
Value Based
No Policy (Implicit)
Value Function
Policy Based
Policy
No Value Function
Actor Critic
- Policy
- Value Function
Categorizing agents (continued)
Model Free
Policy and/or Value Function
No Model
Model Based
Policy and/or Value Function
Model
Example Taxonomy

Computational Complexity of Planning
Algorithmic Problems in Planning
\(\rightarrow\) The techniques successful for either one of these are almost disjoint!
\(\rightarrow\) Satisficing planning is much more effective in practice
\(\rightarrow\) Programs solving these problems are called planners, planning systems, or planning tools.
Decision Problems in Planning
Definition (PlanEx). By PlanEx, we denote the problem of deciding, given a planning task \(P\), whether or not there exists a plan for \(P\).
\(\rightarrow\) Corresponds to satisficing planning.
Definition (PlanLen). By PlanLen, we denote the problem of deciding, given a planning task \(P\) and an integer \(B\), whether or not there exists a plan for \(P\) of length at most \(B\).
\(\rightarrow\) Corresponds to optimal planning.
NP & PSPACE Classes (Refresher)
NP: Decision problems for which there exists a non-deterministic Turing machine that runs in time polynomial in the size of its input.
- Accepts if at least one of the possible runs accepts.
PSPACE: Decision problems for which there exists a deterministic Turing machine that runs in space polynomial in the size of its input.
Relationship between classes: Non-deterministic polynomial space can be simulated in deterministic polynomial space.
- Thus PSPACE = NPSPACE, and hence (trivially) NP \(\subset\) PSPACE.
For details see a computational complexity textbook, such as Garey and Johnson (1979).
Computational Complexity of PlanEx and PlanLen
Theorem. PlanEx and PlanLen are PSPACE-complete.
- “At least as hard as any other problem contained in PSPACE.”
- Details: Bylander (1994).
Domain-Specific PlanEx vs. PlanLen
- In general, both have the same complexity.
- Within particular applications, bounded length plan existence is often harder than plan existence.
- This happens in many IPC benchmark domains: PlanLen is NP-complete while PlanEx is in P.
- For example: Blocksworld and Logistics.
- In practice, optimal planning is (almost) never “easy”.
Is Blocksworld Hard?

Blocksworld is Hard!

Why All the Fuss? Example Blocksworld
- \(n\) blocks, \(1\) hand.
- A single action either takes a block with the hand or puts a block we’re holding onto some other block/the table.
State Space Growth in Blocksworld
| blocks | states |
|---|---|
| 1 | 1 |
| 2 | 3 |
| 3 | 13 |
| 4 | 73 |
| 5 | 501 |
| 6 | 4051 |
| 7 | 37633 |
| 8 | 394353 … |
| blocks | states |
|---|---|
| 9 | 4596553 |
| 10 | 58941091 |
| 11 | 824073141 |
| 12 | 12470162233 |
| 13 | 202976401213 |
| 14 | 3535017524403 |
| 15 | 65573803186921 |
| 16 | 1290434218669921 |
\(\rightarrow\) State spaces may be huge. In particular, the state space is typically exponentially large in the size of its specification via the problem \(\Pi\) (up next).
\(\rightarrow\) In other words: search problems typically are computationally hard (e.g., optimal Blocksworld solving is NP-complete).
Convergence & Contraction Mapping Theorem
Convergence?
How do we know that value iteration converges to \(v_*\)?
Or that iterative policy evaluation converges to \(v_{\pi}\)?
And therefore that policy iteration converges to \(v_*\)?
Is the solution unique?
How fast do these algorithms converge?
These questions are resolved by contraction mapping theorem
Value Function Space
Consider the vector space \(\mathcal{V}\) over value functions
There are \(|\mathcal{S}|\) dimensions
Each point in this space fully specifies a value function \(v(s)\)
What does a Bellman backup do to points in this space?
We will show that it brings value functions closer,
\(\ldots\) and therefore the backups must converge on a unique solution
Value Function \(\infty\)-Norm
We will measure distance between state-value functions \(u\) and \(v\) by the \(\infty\)-norm
i.e. the largest difference between state values,
\[ \|u - v\|_{\infty} = \max_{s \in \mathcal{S}} |u(s) - v(s)| \]
Bellman Expectation Backup is a Contraction
Define the Bellman expectation backup operator \(T^{\pi}\)
\[ T^{\pi}(v) = \mathcal{R}^{\pi} + \gamma \mathcal{P}^{\pi} v \]
This operator is a \(\gamma\)-contraction, i.e. it makes value functions closer by at least \(\gamma\):
\[ \begin{align*} \|T^{\pi}(u) - T^{\pi}(v)\|_{\infty} &= \|(\mathcal{R}^{\pi} + \gamma \mathcal{P}^{\pi} u) - (\mathcal{R}^{\pi} + \gamma \mathcal{P}^{\pi} v)\|_{\infty} \\[4pt] &= \|\gamma \mathcal{P}^{\pi} (u - v)\|_{\infty} \\[4pt] &\le \|\gamma \mathcal{P}^{\pi}\|_{\infty} \, \|u - v\|_{\infty} \\[4pt] &\le \gamma \, \|u - v\|_{\infty} \end{align*} \]
Contraction Mapping Theorem
Convergence of Policy Evaluation and Policy Iteration
The Bellman expectation operator \(T^{\pi}\) has a unique fixed point
\(v_{\pi}\) is a fixed point of \(T^{\pi}\) (by the Bellman expectation equation)
By the contraction mapping theorem
Iterative policy evaluation converges on \(v_{\pi}\)
Policy iteration converges on \(v_{*}\)
Bellman Optimality Backup is a Contraction
Define the Bellman optimality backup operator \(T^{*}\),
\[ T^{*}(v) = \max_{a \in \mathcal{A}} \left( \mathcal{R}^{a} + \gamma \mathcal{P}^{a} v \right) \]
This operator is a \(\gamma\)-contraction,
i.e. it makes value functions closer by at least \(\gamma\)
(similar to previous proof)
\[ \|T^{*}(u) - T^{*}(v)\|_{\infty} \le \gamma \|u - v\|_{\infty} \]
Convergence of Value Iteration
The Bellman optimality operator \(T^*\) has a unique fixed point
\(v_*\) is a fixed point of \(T^*\) (by Bellman optimality equation)
By contraction mapping theorem
Value iteration converges on \(v_*\)
Relationship Between Forward and Backward TD
TD(\(\lambda\)) and TD(\(0\))
When \(\lambda = 0\), only the current state is updated
\[ \begin{align*} E_t(s) & = \mathbf{1}(S_t = s)\\[0pt] V(s) & \leftarrow V(s) + \alpha \delta_t E_t(s) \end{align*} \]
This is exactly equivalent to the TD(\(0\)) update
\[ V(S_t) \leftarrow V(S_t) + \alpha \delta_t \]
TD(\(\lambda\)) and MC(\(0\))
When \(\lambda = 1\), credit is deferred until the end of the episode
Consider episodic environments with offline updates
Over the course of an episode, total update for TD(\(1\)) is the same as total update for MC
MC and TD(\(1\))
Consider an episode where \(s\) is visited once at time-step \(k\)
- TD(1) eligibility trace discounts time since visit:
\[ \begin{align*} E_t(s) & = \gamma E_{t-1}(s) + \mathbf{1}(S_t = s)\\[0pt] & = \begin{cases} 0, & \text{if } t < k \\ \gamma^{t-k}, & \text{if } t \ge k \end{cases} \end{align*} \]
- TD(1) updates accumulate error online:
\[ \sum_{t=1}^{T-1} \alpha \delta_t E_t(s) = \alpha \sum_{t=k}^{T-1} \gamma^{t-k} \delta_t = \alpha (G_k - V(S_k)) \]
- By the end of the episode, it accumulates total error:
\[ \delta_k + \gamma \delta_{k+1} + \gamma^2 \delta_{k+2} + \dots + \gamma^{T-1-k} \delta_{T-1} \]
Telescoping in TD(\(1\))
When \(\lambda = 1\), the sum of TD errors telescopes into the Monte Carlo (MC) error:
\[ \begin{align*} & \delta_t + \gamma \delta_{t+1} + \gamma^{2}\delta_{t+2} + \dots + \gamma^{T-1-t}\delta_{T-1}\\[0pt] & =\; R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\\[0pt] & +\; \gamma R_{t+2} + \gamma^{2} V(S_{t+2}) - \gamma V(S_{t+1})\\[0pt] & +\; \gamma^{2} R_{t+3} + \gamma^{3} V(S_{t+3}) - \gamma^{2} V(S_{t+2})\\[0pt] & \quad \quad \quad \vdots\\[0pt] & +\; \gamma^{T-1-t} R_T + \gamma^{T-t} V(S_T) - \gamma^{T-1-t} V(S_{T-1})\\[0pt] & =\; R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + \dots + \gamma^{T-1-t} R_T - V(S_t)\\[0pt] & =\; G_t - V(S_t) \end{align*} \]
TD(\(\lambda\)) and TD(\(1\))
TD(\(1\)) is roughly equivalent to every-visit Monte-Carlo
Error is accumulated online, step-by-step
If value function is only updated online at end of episode
Then total update is exactly the same as MC
Telescoping in TD(\(\lambda\))
For general \(\lambda\), TD errors also telescope to the \(\lambda\)-error, \(G_t^{\lambda} - V(S_t)\)
\[ \begin{align*} & G_t^{\lambda}-V(S_t)\hspace{-15mm} &=-V(S_t) &+(1-\lambda)\lambda^{0}(R_{t+1} + \gamma V(S_{t+1}))\\[0pt] & & &+(1-\lambda)\lambda^{1}(R_{t+1}+\gamma R_{t+2}+\gamma^{2} V(S_{t+2})) \\[0pt] & & & + (1 - \lambda)\lambda^{2}(R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + \gamma^{3} V(S_{t+3}))\\[0pt] & & &+ \dots \\[0pt] & &= -V(S_t) &+(\gamma\lambda)^{0}(R_{t+1}+\gamma V(S_{t+1})-\gamma\lambda V(S_{t+1})) \\[0pt] & & & + (\gamma\lambda)^{1}(R_{t+2} + \gamma V(S_{t+2}) - \gamma\lambda V(S_{t+2})) \\[0pt] & & & + (\gamma\lambda)^{2}(R_{t+3} + \gamma V(S_{t+3}) - \gamma\lambda V(S_{t+3})) \\[0pt] & & & + \dots \\[0pt] & &= \quad \quad \quad \quad & \quad (\gamma\lambda)^{0}(R_{t+1} + \gamma V(S_{t+1}) - V(S_t)) \\[0pt] & & & + (\gamma\lambda)^{1}(R_{t+2} + \gamma V(S_{t+2}) - V(S_{t+1})) \\[0pt] & & & + (\gamma\lambda)^{2}(R_{t+3} + \gamma V(S_{t+3}) - V(S_{t+2})) \\[0pt] & & & + \dots \\[0pt] & & & \hspace{-37mm} = \delta_t + \gamma\lambda \delta_{t+1} + (\gamma\lambda)^2 \delta_{t+2} + \dots \end{align*} \]
Forwards and Backwards TD(\(\lambda\))
Consider an episode where \(s\) is visited once at time-step \(k\)
TD(\(\lambda\)) eligibility trace discounts time since visit:
\[ \begin{align*} E_t(s) & = \gamma \lambda E_{t-1}(s) + \mathbf{1}(S_t = s)\\[0pt] & = \begin{cases} 0, & \text{if } t < k \\ (\gamma \lambda)^{t-k}, & \text{if } t \ge k \end{cases} \end{align*} \]
Backward TD(\(\lambda\)) updates accumulate error online:
\[ \sum_{t=1}^{T} \alpha \delta_t E_t(s) = \alpha \sum_{t=k}^{T} (\gamma \lambda)^{t-k} \delta_t = \alpha \left(G_k^{\lambda} - V(S_k)\right) \]
By the end of the episode, it accumulates total error for the \(\lambda\)-return
For multiple visits to \(s\), \(E_t(s)\) accumulates many errors
Offline Equivalence of Forward and Backward TD
Offline updates
Updates are accumulated within episode
but applied in batch at the end of episode
Online Equivalence of Forward and Backward TD
Online updates
TD(\(\lambda\)) updates are applied online at each step within episode
Forward and backward-view TD(\(\lambda\)) are slightly different
Exact online TD(\(\lambda\)) achieves perfect equivalence
By using a slightly different form of eligibility trace
Harm van Seijen, A. Rupam Mahmood, Patrick M. Pilarski, Marlos C. Machado, Richard S. Sutton; True Online Temporal-Difference Learning. Journal of Machine Learning Research (JMLR), 17(145):1−40, 2016.
Summary of Forward and Backward TD(\(\lambda\))
\[ \begin{array}{|c|c|c|c|} \hline {\color{red}{\text{Offline updates}}} & {\color{red}{\lambda = 0}} & {\color{red}{\lambda \in (0,1)}} & {\color{red}{\lambda = 1}} \\ \hline \text{Backward view} & \text{TD(0)} & \text{TD}(\lambda) & \text{TD(1)} \\[0pt] & || & || & || \\[0pt] \text{Forward view} & \text{TD(0)} & \text{Forward TD}(\lambda) & \text{MC} \\[0pt] \hline {\color{red}{\text{Online updates}}} & {\color{red}{\lambda = 0}} & {\color{red}{\lambda \in (0,1)}} & {\color{red}{\lambda = 1}} \\ \hline \text{Backward view} & \text{TD(0)} & \text{TD}(\lambda) & \text{TD(1)} \\[0pt] & || & \neq & \neq \\[0pt] \text{Forward view} & \text{TD(0)} & \text{Forward TD}(\lambda) & \text{MC} \\[0pt] & || & || & || \\[0pt] \text{Exact Online} & \text{TD(0)} & \text{Exact Online TD}(\lambda) & \text{Exact Online TD(1)} \\ \hline \end{array} \]
= here indicates equivalence in total update at end of episode.
Linear Least Squares Methods
Linear Least Squares Prediction
Experience replay finds the least squares solution
- But may take many iterations.
Using the special case of linear value function approximation
\[ \hat{v}(s,\mathbf{w}) = x(s)^\top \mathbf{w} \]
We can solve the least squares solution directly using a closed form
Linear Least Squares Prediction (2)
At the minimum of \(LS(\mathbf{w})\), the expected update must be zero:
\[\begin{align*} \mathbb{E}_{\mathcal{D}} \left[ \Delta \mathbf{w} \right] &= 0\;\;\; \text{(want update zero across data set)} \\[0pt] \alpha \sum_{t=1}^{T} \mathbf{x}(s_t) \bigl( v_t^{\pi} - \mathbf{x}(s_t)^{\top} \mathbf{w} \bigr) &= 0\;\;\; \text{(unwrapping expected updates)}\\[0pt] \sum_{t=1}^{T} \mathbf{x}(s_t) v_t^{\pi} &= \sum_{t=1}^{T} \mathbf{x}(s_t) \mathbf{x}(s_t)^{\top} \mathbf{w} \\[0pt] \mathbf{w} &= \left( \sum_{t=1}^{T} \mathbf{x}(s_t) \mathbf{x}(s_t)^{\top} \right)^{-1} \sum_{t=1}^{T} \mathbf{x}(s_t) v_t^{\pi} \end{align*}\]
Compute time: \(O(N^3)\) for \(N\) features (direct),
or incremental \(O(N^2)\) using Sherman–Morrison.
- Sometimes better or cheaper than experience replay, e.g. if have only a few features for the linear cases.
Linear Least Squares Prediction Algorithms
We do not know true values \(v_t^\pi\)
- In practice, our “training data” must use noisy or biased samples of \(v^{\pi}_t\)
\[\begin{align*} {\color{blue}{\text{LSMC}}} \;\;\;\;&\text{Least Squares Monte-Carlo uses return}\\[0pt] & v_t^\pi \approx {\color{red}{G_t}} \\[0pt] {\color{blue}{\text {LSTD}}} \;\;\;\; & \text{Least Squares Temporal-Difference uses TD target}\\[0pt] & v_t^\pi \approx {\color{red}{R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w})}}\\[0pt] {\color{blue}{\text{LSTD}}}{\color{blue}{(\lambda)}}\;\;\;\; & \text{Least Squares TD}(\lambda)\text{ uses}\;\lambda\text{-return} \\[0pt] & v_t^\pi \approx {\color{red}{G_t^\lambda}} \end{align*}\]
In each case we can solve directly for the fixed point of MC / TD / TD(\(\lambda\))
Linear Least Squares Prediction Algorithms (2)
\[\begin{align*} {\color{red}{\text{LSMC}}} \;\;\;\; 0 & = \sum_{t=1}^T \alpha \big(G_t - \hat{v}(S_t,\mathbf{w})\big)\, \textbf{x}(S_t),\qquad\\[0pt] \mathbf{w} & = \left(\sum_{t=1}^T \textbf{x}(S_t) \textbf{x}(S_t)^\top \right)^{-1} \left(\sum_{t=1}^T \textbf{x}(S_t) G_t \right)\\[0pt] {\color{red}{\text{LSTD}}} \;\;\;\; 0 & = \sum_{t=1}^T \alpha \big(R_{t+1} + \gamma \hat{v}(S_{t+1},\mathbf{w}) - \hat{v}(S_t,\mathbf{w})\big)\, \textbf{x}(S_t)\\[0pt] \mathbf{w} & = \left(\sum_{t=1}^T \textbf{x}(S_t)\big(x(S_t) - \gamma \textbf{x}(S_{t+1})\big)^\top \right)^{-1} \left(\sum_{t=1}^T \textbf{x}(S_t) R_{t+1}\right)\\[0pt] {\color{red}{\text{LSTD}}}{\color{red}{(\lambda)}}\;\;\;\; 0 & = \sum_{t=1}^T \alpha\, \delta_t\, E_t,\qquad\\[0pt] \mathbf{w} & = \left(\sum_{t=1}^T E_t \big(\textbf{x}(S_t) - \gamma \textbf{x}(S_{t+1})\big)^\top \right)^{-1} \left(\sum_{t=1}^T E_t R_{t+1}\right) \end{align*}\]
Convergence of Linear Least Squares Prediction Algorithms
\[ \begin{array}{l l c c c} \hline \text{On/Off-Policy} & \text{Algorithm} & \text{Table Lookup} & \text{Linear} & \text{Non-Linear} \\ \hline \text{On-Policy} & \textit{MC} & \checkmark & \checkmark & \checkmark \\ & {\color{red}{\textit{LSMC}}} & {\color{red}{\checkmark}} & {\color{red}{\checkmark}} & {\color{red}{-}} \\ & \textit{TD} & \checkmark & \checkmark & \text{✗} \\ & {\color{red}{\textit{LSTD}}} & {\color{red}{\checkmark}} & {\color{red}{\checkmark}} & {\color{red}{-}} \\ \hline \text{Off-Policy} & \textit{MC} & \checkmark & \checkmark & \checkmark \\ & {\color{red}{\textit{LSMC}}} & {\color{red}{\checkmark}} & {\color{red}{\checkmark}} & {\color{red}{-}} \\ & \textit{TD} & \checkmark & \text{✗} & \text{✗} \\ & {\color{red}{\textit{LSTD}}} & {\color{red}{\checkmark}} & {\color{red}{\checkmark}} & {\color{red}{-}} \\ \hline \end{array} \]
Least Squares Policy Iteration
Policy evaluation Policy evaluation by least-squares Q-learning
Policy improvement Greedy policy improvement
Least Squares Action-Value Function Approximation
Approximate action-value function \(q_\pi(s, a)\)
- using linear combination of features \(\mathbf{x}(s, a)\)
\[ \hat{q}(s, a, \mathbf{w}) = \mathbf{x}(s, a)^\top \mathbf{w} \;\approx\; q_\pi(s, a) \]
Minimise least squares error between \(\hat{q}(s, a, \mathbf{w})\) and \(q_\pi(s, a)\)
from experience generated using policy \(\pi\)
consisting of \(\langle (state, action), value \rangle\) pairs
\[ \mathcal{D} = \Bigl\{ \langle (s_1, a_1), v_1^\pi \rangle,\; \langle (s_2, a_2), v_2^\pi \rangle,\; \ldots,\; \langle (s_T, a_T), v_T^\pi \rangle \Bigr\} \]
Least Squares Control
For policy evaluation, we want to efficiently use all experience
For control, we also want to improve the policy
The experience is generated from many policies
So to evaluate \(q_\pi(S,A)\) we must learn off-policy
We use the same idea as Q-learning:
- Experience from old policy: \((S_t, A_t, R_{t+1}, S_{t+1}) \sim \pi_{\text{old}}\)
- Consider alternative successor action \(A^{\prime} = \pi_{\text{new}}(S_{t+1})\)
- Update \(\hat{q}(S_t, A_t, \mathbf{w})\) toward
\[ R_{t+1} + \gamma\, \hat{q}(S_{t+1}, A^{\prime}, \mathbf{w}) \]
Least Squares Q-Learning
Consider the following linear Q-learning update
\[\begin{align*} \delta &= R_{t+1} + \gamma \hat{q}(S_{t+1}, {\color{red}{\pi(S_{t+1})}}, \mathbf{w}) - \hat{q}(S_t, A_t, \mathbf{w}) \\[0.5em] \Delta \mathbf{w} &= \alpha \, \delta \, \mathbf{x}(S_t, A_t) \end{align*}\]
LSTDQ algorithm: solve for total update \(=\) zero
\[\begin{align*} 0 &= \sum_{t=1}^T \alpha \Big( R_{t+1} + \gamma \hat{q}(S_{t+1}, {\color{red}{\pi(S_{t+1})}}, \mathbf{w}) - \hat{q}(S_t, A_t, \mathbf{w}) \Big)\mathbf{x}(S_t, A_t) \\[1em] \mathbf{w} &= \Bigg( \sum_{t=1}^T \mathbf{x}(S_t, A_t)\big(\mathbf{x}(S_t, A_t) - \gamma \mathbf{x}(S_{t+1}, \pi(S_{t+1}))\big)^\top \Bigg)^{-1} \sum_{t=1}^T \mathbf{x}(S_t, A_t) R_{t+1} \end{align*}\]
Least Squares Policy Iteration Algorithm
The following pseudo-code uses LSTDQ for policy evaluation
It repeatedly re-evaluates experience \(\mathcal{D}\) with different policies
Convergence of Control Algorithms
\[ \begin{array}{l l c c c} \hline \text{Algorithm} & \text{Table Lookup} & \text{Linear} & \text{Non-Linear} \\ \hline \textit{Monte-Carlo Control} & \checkmark & (\checkmark) & \text{✗} \\ \textit{Sarsa} & \checkmark & (\checkmark) & \text{✗} \\ \textit{Q-Learning} & \checkmark & \text{✗} & \text{✗} \\ \textit{\color{red}{LSPI}} & \checkmark & (\checkmark) & - \\ \hline \end{array} \]
(\(\checkmark\)) = chatters around near-optimal value function.
Chain Walk Example (More complicated random walk)
Consider the 50 state version of this problem (bigger replica of this diagram)
Reward \(+1\) in states \(10\) and \(41\), \(0\) elsewhere
Optimal policy: R (\(1-9\)), L (\(10-25\)), R (\(26-41\)), L (\(42, 50\))
Features: \(10\) evenly spaced Gaussians (\(\sigma = 4\)) for each action
Experience: \(10,000\) steps from random walk policy
LSPI in Chain Walk: Action-Value Function
Plots show LSPI iterations on a 50-state chain with a radial basis function approximator.
The colours represent two different actions of going left (blue) or right (red)
The plots show the true value function (dashed) or approximate value function
You can see it converges to the optimal policy after only \(7\) iterations.
LSPI in Chain Walk: Policy
Plots of policy improvement over LSPI iterations (using same 50-state chain example).
- Shows convergence of LSPI policy to optimal within a few iterations.
Natural Actor-Critic (Advanced Topic)
Alternative Policy Gradient Directions
Gradient ascent algorithms can follow any ascent direction
- A good ascent direction can significantly speed convergence
Also, a policy can often be reparametrised without changing action probabilities
For example, increasing score of all actions in a softmax policy
The vanilla gradient is sensitive to these reparametrisations
Compatible Function Approximation (CFA)
Policy Gradient with a Learned Critic
We want to estimate the true gradient \[ \nabla_\theta J(\theta) = \mathbb{E}_\pi \!\left[ \nabla_\theta \log \pi_\theta(a|s)\, Q^\pi(s,a) \right] \]
But \(Q^\pi(s,a)\) is unknown — so we approximate it with a parametric critic \({\color{red}{Q_w(s,a)}}\)
Compatibility Conditions for Critic
Linear in the policy’s score function \[ Q_w(s,a) = w^\top \nabla_\theta \log \pi_\theta(a|s) \]
Weights chosen to minimise on-policy error \[ w^* = \arg\min_w \mathbb{E}_\pi \left[ \big(Q^\pi(s,a) - Q_w(s,a)\big)^2 \right] \]
Compatibility Conditions—A Beautiful Result
If both compatibility conditions hold, then
\[ \nabla_\theta J(\theta) = \mathbb{E}_\pi \left[ \nabla_\theta \log \pi_\theta(a|s)\, Q_w(s,a) \right] \]
That is — the approximate critic yields the exact policy gradient.
Intuition
The critic lies in the same space as the policy’s gradient features
The actor and critic share a common geometry
The critic projects \(Q^\pi\) onto the policy’s tangent space
Variance reduction without bias — a perfect harmony between learning and estimation
Natural Policy Gradient—Fisher Information Matrix
The natural policy gradient is parametrisation independent
- It finds the ascent direction that is closest to the vanilla gradient, when changing policy by a small, fixed amount
\[ \nabla^{\text{nat}}_{\theta} \pi_{\theta}(s,a) = G_{\theta}^{-1} \nabla_{\theta} \pi_{\theta}(s,a) \]
- where \(G_{\theta}\) is the Fisher information matrix (Measures how sensitive a probability distribution is to changes in its parameters) \[ G_{\theta} = \mathbb{E}_{\pi_{\theta}} \Big[ {\color{blue}{\nabla_{\theta} \log \pi_{\theta}(s,a) \; \nabla_{\theta} \log \pi_{\theta}(s,a)^\top }}\Big] \]
Natural Actor-Critic—Compatible Function Approximation
Uses compatible function approximation,
\[ \nabla_{\mathbf{w}} A_{\mathbf{w}}(s,a) = \nabla_{\theta} \log \pi_{\theta}(s,a) \]
So the natural policy gradient simplifies,
\[ \begin{align*} \nabla_{\theta} J(\theta) &= \mathbb{E}_{\pi_{\theta}} \big[ \nabla_{\theta} \log \pi_{\theta}(s,a) \; A^{\pi_{\theta}}(s,a) \big] \\[2pt] &= \mathbb{E}_{\pi_{\theta}} \Big[ \nabla_{\theta} \log \pi_{theta}(s,a) \, \nabla_{\theta} \log \pi_{\theta}(s,a)^\top \Big] \mathbf{w} \\[2pt] &= G_{\theta} \, \mathbf{w}\\[2pt] {\color{red}{\nabla_{\theta}^{\text{nat}} J(\theta)}} & {\color{red}{= \mathbf{w}}} \end{align*} \]
i.e. update actor parameters in direction of critic parameters \(\mathbf{w}\)
Natural Policy Gradient—Parameterisation Invariant
- The black contours show level sets of equal performance \(J(\theta)\)
- The blue arrows show the ordinary gradient ascent direction for each point in parameter space (not orthogonal to the contours, behave irregularly across space)
- The red arrows are the natural gradient directions (ascent directions are orthogonal to contours and smoothly distributed)
- Natural gradient respects geometry of probability distributions—it finds steepest ascent direction measured in policy (distributional) space, not raw parameter space
Natural Actor Critic in Snake Domain
- The dynamics are nonlinear and continuous—a good test of policy gradient methods.
Natural Actor Critic in Snake Domain (2)
- The geometry of the corridor causes simple gradient updates to oscillate or diverge, but natural gradients (which adapt to curvature via the Fisher matrix independent of parameters) make stable, efficient progress.
Proximal Policy Optimisation (PPO) (Advanced Topic)
Proximal Policy Optimisation (PPO)
Goal: Stable, sample-efficient policy improvement
Idea: Constrain how far the new policy moves from the old one at each update during actor-critic cycle
Policy Objective Functions (Revisited)
From a policy gradient perspective, the true objective function \(J^{{\pi}_{\theta}}(\theta)\) over policy \(\pi_{\theta}\) is defined as follows
\[ J^{{\pi}_{\theta}}(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}} [G_0]=V^{\pi \theta}(d_0) \]
where \(d_0\) is the distribution over initial states at the start of each episode \(\tau\)
However, there are two problems in the actor-critic setting:
1. Computing \(J^{{\pi}_{\theta}}(\theta)\) exactly would require integrating over all trajectories, \(\tau\), of the current policy, which is impractical
2. If we update the parameters \(\theta\), it will effect objective value during the optimisation process, leading to (circular) feedback
We therefore need a surrogate objective independent of the trajectory distribution under the new policy \(\color{blue}{\pi_{\theta}}\) we are building
Surrogate Objective
From the policy-gradient theorem, we can define the importance ratio
\[ r_t(\theta) \;=\; \frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\text{old}}}(a_t\mid s_t)} \qquad \]
We now define the surrogate objective \(L_{PG}\) for the true objective
\[ L_{PG}(\theta) \;=\; \mathbb{E}_t\!\big[\, r_t(\theta)\,\hat A_t \,\big] \]
Where \(\hat{A}_t\) captures how much better action \(a_t\) was than the state’s average
\(\hat{A}_t\) is an estimator of the true advantage function \(A^{\pi}\)
\(A^{\pi}(s_t,a_t) = Q^{\pi}(s_t, a_t) - V^{\pi}(S_t)\)
Clipped Surrogate (Core Idea)
Kullback-Leibler (KL) divergence theory tells us we want improvement without overly large steps in policy space, so we define
\[ L^{\text{CLIP}}(\theta) \;=\; \mathbb{E}_t\! \left[ \min\!\Big( r_t(\theta)\,\hat A_t,\; \mathrm{clip}\!\big(r_t(\theta),\,1-\epsilon,\,1+\epsilon\big)\,\hat A_t \Big) \right] \]
If \(r_t(\theta)\) leaves the interval \([1-\epsilon,\,1+\epsilon]\), the objective is clipped.
Typical range for \(\epsilon \in [0.1,\,0.2]\).
Prevents destructive updates while preserving ascent direction
The clipped surrogate objective in PPO plays a similar stabilising role to compatible function approximation — both constrain policy updates so that gradient estimates remain accurate and unbiased with respect to the true policy improvement direction.
Complete PPO loss
\[ L^{\text{PPO}}(\theta) = \mathbb{E}_t\!\Big[{\color{blue}{L^{\text{CLIP}}(\theta)}} - c_1\,{\color{red}{\big(V_\theta(s_t)-V_t^{\text{target}}\big)^2}} + c_2\,\mathcal{H}\!\left[\pi_\theta(\cdot\mid s_t)\right] \Big] \] \(\;\;\;\;\;\;\;\;\;\;\;\;\;\;\)where \(c_1, c_2\) are coefficients
The actors policy gradient (surrogate objective) is \(\color{blue}{L^{CLIP}(\theta)}\)
- This encourages the policy to increase the probability of actions with positive advantage and decrease it for negative ones
The critics value function is \(\color{red}{\big(V_\theta(s_t)-V_t^{\text{target}}\big)^2}\)
- This trains the network to predict correct returns (mean-squared error).
The entropy bonus \(\mathcal{H}\) encourages exploration
\[ \mathcal{H}\big[\pi_\theta(\cdot \mid s_t)\big] = - \sum_{a} \pi_\theta(a \mid s_t) \log \pi_\theta(a \mid s_t) \]
The entropy term encourages exploration by rewarding stochastic (uncertain) policies.
It’s high when the policy is uncertain or “spread out” (exploratory).
It’s low when the policy is confident or deterministic.
The dot “\(\cdot\)” in \(\pi_{\theta}(\cdot | s_t)\) means over all possible actions, i.e. the vector of probabilities \(\pi_{\theta}(a_1,s_t), \pi_{\theta}(a_2,s_t), \ldots\)
In practice this maintains stochasticity until policy becomes more confident or deterministic
Generalised Advantage Estimation (GAE)
In practice, PPO uses a low-variance, low-bias estimate of the advantage \(A^\pi(s_t,a_t)\).
TD error: \[ \delta_t \;=\; r_t + \gamma\,V_\phi(s_{t+1}) - V_\phi(s_t) \]
GAE-\(\lambda\): \[ \hat A_t^{(\lambda)} \;=\; \sum_{l=0}^{\infty} (\gamma\lambda)^l\,\delta_{t+l} \;=\; \delta_t + \gamma\lambda\,\delta_{t+1} + (\gamma\lambda)^2\,\delta_{t+2} + \cdots \]
Return/target used for critic \[ \hat V_t^{\text{target}} \;=\; \hat A_t^{(\lambda)} + V_\phi(s_t) \]
- \(\lambda\in[0,1]\) trades bias \(\leftrightarrow\) variance, typical PPO: \(\gamma \approx 0.99\), \(\lambda \approx 0.95\).
PPO Algorithm
- Multiple epochs over the same batch are okay because clipping limits drift
Why PPO works (intuition)
First-order solution method
- No additional constraint solving required which can introduce second-order effects
Trust-region-like behaviour via clipping
- Optimisation within trust-region involves only taking steps that stay within a region where local approximation is reliable (founded on KL information theory)
Robust across discrete/continuous control
- Strong baseline performance in practice
PPO Fine tuning ChatGPT & human feedback (Revisited)
- Proximal policy optimisation (PPO) is used by ChatGPT 3.5
PPO in Agentic AI, Robotics & World Models
PPO is becoming popular again for Agentic AI modes, and is used in
Chat GPT’s Operator, and
Claude’s Computer Use modes
PPO is also becoming popular for World Models in robotics, and is used in
- Vision, Language & Action (VSA) attention-based transformers
Recent Policy Optimisation Techniques
ChatGPT4 uses Direct Preference Optimisation (DPO)
- Rafailov et al. Direct Preference Optimization: Your Language Model is Secretly a Reward Model , arXiv:2305.18290v3, pp1-27, 2023
DeepSeek’s R1 uses Group Relative Policy Optimization (GRPO)
- Shao et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, arXiv:2402.03300v3, pp1-30, 2024