12 Planning & RL for Transformers (Advanced Topic)
Slides
This module is also available in the following versions
This Module
Last Module: Deep learning & Tree Search
This Module: Planning & RL for Transformers
What is covered? How to implement planning and RL on GPU/TPUs.
Introduction to Attention-based Transformers
Attention-based Transformers & Implicit Planning
Attention-based transformers can simulate multiple futures inside a hidden state.
Self-attention makes lookahead possible:
- Attention learns which futures matter
- Transformers can simulate “multiple futures” inside the hidden state
Implicit planning is an amortised form of planning, i.e. pay the cost of planning upfront during training, so that at test time the system can act without running a full planner or search each time.
- Attention-based transformers have the capacity to use neural network model’s implicit planning capacity
The Elements of Attention-based Transformers
- Autoregression & joint distribution factorisation
- Masked latent transformers
- Attention-based transformers
- Causal masking
1. Autoregression & joint distribution factorisation
Autoregression models a sequence by predicting each element from all previous elements, i.e. \(x_t = f(x_{1:t-1}) + \epsilon\) where \(\epsilon\) is the noise or uncertainty. It factorises a joint distribution as follows:
\[ p(x_{1:T}) = \prod_{t=1}^{T} p(x_t \mid x_{1:t-1}) \]
where \(p(x_{1:T})\) is the probability of seeing the entire sequence.
Used in time series, sequence modelling, and autoregressive transformers.
At inference time, predictions are fed back step-by-step to generate sequences.
To enforce causal structure, transformers use masking (next step).
2. Causal latent transformers
A causal latent transformer is a transformer that models sequences of latent states, using attention masks to enforce causality or planning structure, just like in autoregressive language models.
- It is a transformer that predicts (or refines) latent variables in a sequence, but only using allowed past or partial information, enforced through a mask.
Causal latent transformers appear in world-model RL, video generation, and planning-as-generation frameworks (implicit planning in neural networks).
- They are increasingly used as a replacement for recurrent neural network (RNN) and latent dynamics models, such as RSSM used in Dreamer-like systems (Note that Dreamer or recursive approaches are not covered in this Module).
3. Attention-based transformers
Transformers model sequences using self-attention, where each token computes weighted interactions with other tokens.
Key components:
Token embeddings and positional encodings
Self-attention:
\[ \mathrm{Attn}(Q, K, V) = \mathrm{softmax}\!\left( \frac{QK^\top}{\sqrt{d_k}} \right) V \]
where \(Q=\mathit{queries}\), \(K=\mathit{keys}\), \(V=\mathit{values}\) and \(d_k\) is the dimension of the key vectors (layers of the network) and \(\top\) corresponds to matrix transpose.
- Multi-head attention, feed forward layers, residual connections
4. Causal masking
Causal masking is used in autoregressive transformers:
Each token attends only to past tokens
Enforces the autoregressive condition
\[ x_t \sim p(x_{t+1} \mid x_{1:t-1}) \]
where \(\sim\) means sampled from.
This condition means that when predicting the next token, the model is only allowed to use earlier tokens, not future ones.
This masked attention mechanism directly carries over to masked latent transformers used in world-model RL.
Masked latent transformers & languages
The original Transformer was introduced as an attention-only sequence model, initially for natural language translation
- but its broader importance is that it has became a general computational approach for modelling structured representations.
- Masked latent transformers have been shown to be Turing complete.
Importantly, transformers have shown that multiple languages can be treated as sequences or structured token streams
- For example English (or any language), Python, Markdown/LaTeX (these slides), mathematical proofs, SQL, even multi-modal representations.
Visualising attention-based transformers
A useful visualisation approach, for understanding attention-based transformers is as follows:
Horizontal Axis (Tokens/Sequence): Each position corresponds to a specific token (word or sub-word) in your input.
Vertical Axis (Layers/Depth): As you move “up” the network layers, the model is building more abstract, contextual meanings. The bottom layers might focus on simple syntax, while higher layers handle complex relationships and intent.
Attention Intersections: The “links” connecting horizontal tokens across the vertical layers represent the Attention Weights, showing which other words the model is looking at to understand the current word.
Visualising attention-based transformers (continued)
This 3Blue1Brown video demonstrates the horizontal (tokens) / vertical (layers) visualisation.
Example: Masked latent transformers in LLMs
For large language models (LLMs) in natural language processing
Tokens = words
Mask = causal mask (can’t see the future)
In masked latent transformers:
Tokens = latent states \(z_t\) (learned hidden representations)
Mask = ensures correct temporal, causal, or planning structure
Example: LLMs for PDDL
Pretrained LLMs can also harness PDDL for generalised planning by being prompted with a PDDL domain and a small number of training problem instances.
A solution to generalised planning finds a reusable policy that solves many related problem instances, not just one specific initial state.
GPT-4 is used to synthesise and output a Python generalised planner, the program is tested and if it fails, GPT-4 is re-prompted with debugging feedback.
GPT-4 outperforms one of the leading domain-independent classical planner Fast Downward on generalised planning on benchmarks
GPT-4 is shown to be 1-2 orders of magnitude faster during inference but does not quite match coverage for benchmarks (misses a few). Although subsequent models are likely to improve this.
Silver et al. (2024) Generalized planning in PDDL domains with pretrained large language models https://ojs.aaai.org/index.php/AAAI/article/view/30006
Serialization of Reasoning: Chain of Thought
Serialization of Chain of Thought Reasoning
Reasoning in attention-based transformers is implicit because reasoning happens inside a serialized, single-token stream
- Reasoning tokens to represent a reasoning process capturing chain of thought reasoning steps
- Self-Reflection & Correction: Serialization facilitates self-reflection via specialised tokens (e.g., “wait,” “but”) which act as triggers for error-checking and hypothesis reassessment.
- Computational State: Tokens serve as a persistent “scratchpad,” allowing the model to offload intermediate results and extend its computational budget.
- Backtracking: Enables the model to autonomously identify and pivot away from incorrect logical branches without external reward models.
Example: DeepSeek-R1 (Mathematics)
1. Initial Logic & Calculation
The model begins solving a geometry puzzle by mapping variables.
2. Self-Correction
A trigger word (“Wait”) signals model has detected a constraint violation.
3. Alternative Strategy Exploration
The model discards the old path and restarts with the correct data.
4. Final Verification
Before outputting, the model checks for internal consistency.
- GRPO: Shao et al. (2024), DeepSeekMath arXiv:2402.03300
- DeepSeek-R1 Guo et al. (2025) DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning, Nature https://www.nature.com/articles/s41586-025-09422-z
Example: GPT o1, Claude 3.7 Sonnet & DeepSeek R1
GPT-o1 and subsequent models utilise chain of thought as mechanisms within reasoning.
- o1 introduced chain of thought style reasoning at inference in 2024, which enables training using RL for teaching the model to reason before answering
Claude 3.7 Sonnet introduced chain of thought reasoning in 2025.
DeepSeek-R1 utilises chain of thought
- R1’s utilised Group Relative Policy Optimisation (GRPO), a memory efficient fine tuning technique harnessing chain of thought at training and inference (covered later in this Module).
Example: Gemini Deep Think
Gemini Deep Think is a reasoning mode designed for hard problems in mathematics, coding, planning, and science.
Rather than producing one immediate answer, it uses:
Extended inference-time compute
More “thinking time” before answering.Parallel thinking
Multiple hypotheses or solution paths are explored and compared.Critique and revision
Candidate answers can be checked, refined, or rejected.RL-shaped reasoning behaviour
Google describes novel reinforcement learning techniques that encourage the model to use longer, multi-step reasoning paths.
Example: Mathematics Gold Medal (Gemini Deep Think)

DeemMind’s Gemini Deep Think Model wins Gold Medal at International Mathematical Olympiad (IMO)
Example: Gemini Deep Think (RL analogy)
Deep Think can be viewed as moving from answer prediction toward policy improvement over reasoning traces.
| RL concept | Deep Think analogy |
|---|---|
| State | Current problem + partial reasoning context |
| Action | Generate, branch, critique, revise, verify |
| Reward | Correctness, coherence, problem-solving success |
| Policy | Model’s learned strategy for choosing reasoning steps |
| Search | Parallel exploration of possible solution paths |
Example: Chain of thought in Anthropic’s Mythos
Anthropic’s Mythos model utilises human supervised chain of thought
Mythos is useful for penetration testing in cyber security
Utilises human supervised reasoning traces from chains of thought as a training target
Human supervision steers the model toward which vulnerabilities would be genuinely serious, dangerous bug classes, and potential security flaws
The model is trained to learn meaningful exploit hypotheses
Classical planning was originally utilised for penetration testing, through the development of PDDL represented attack models.
Faithfulness: Human Supervised Chain of Thought
Human supervised chain of thought models can create a safety challenge, as the model is trained to generate reasoning traces that humans expect
The risk is that the model learns to produce reasoning that looks acceptable rather than reasoning that is fully faithful
This can lead to what appears from a human perspective to be deceptive behaviour by the model, even though it makes sense from the attention-based transformer’s loss perspective.
Faithfulness: Open Problem in LLMs
Faithfulness of chain-of-thought reasoning under human supervision remains an unresolved problem
- Most LLMs do not directly train on human-labelled reasoning traces as ground-truth targets (or do so only in limited ways)
There is the key challenge:
Performance: reasoning-trace supervision can improve capability (e.g. decomposition, verification, structure)
Faithfulness: the trace may not reflect the model’s true internal reasoning
A reasoning trace can be, useful, plausible and reward-optimised without being causally faithful, i.e. Improving reasoning \(\neq\) ensuring faithful explanations
Example: Edge Computing (Gemma 4: Open Multimodal Model)
DeepMind released Gemma 4 on 2 April, 2026.
Performance: The 31B model acts as a “server-grade” model for local use, while smaller variants (2B/4B) are optimised for mobile, edge and on-device.
Agentic Capabilities: Built for multi-step planning, autonomous actions, and tool use.
Multimodal: Natively supports text, images, and audio, along with dynamic vision resolution.
Large Context: Supports up to 256,000 token context windows
Open-Source: Released under the Apache 2.0 licence
Knowledge distillation is a key element in training Gemma 4.
Agentic (Multi-Agent) Transformers
Major paradigms
Major paradigms are emerging which reflect a convergence of RL with autoregression and transformer models.
Multi Actor-Critic Transformers (online, no dynamics).
World-Model RL (planning, includes dynamics)
Vision (multi-modal)
Agentic (roles)
Example: Actor-Critic Transformers
Actor-critic transformers are RL agents that use transformer architectures to parameterise
- the actor, critic, or both,
- while still relying on Bellman equations and policy gradients for learning.
Actor-Critic Transformers essentially outperform Long Short-Term Memory (LSTMs) networks in long-horizon Partially Obervable Markov Decision Processes (POMDPs)
Three Layer Agent Stack
| Layer | Role (e.g. of an Agent or Robot) | Typical tools |
|---|---|---|
| Cognition / Reasoning | Query answering, Programming, multi-step thinking & planning: goal decomposition, tool selection, self-reflection, backtracking & correction, safety checks etc. | LLM with serialized Chain of Thought (GPT, Gemini, Claude, DeepSeek - Thinking, Programming & Operator modes; OpenClaw, MiniMax, etc.) |
| Semantic Policy (Vision–Language–Action) | Grounds instructions & scene into actionable subgoals / waypoints | Vision-Language-Action (VLA) Transformers (RT-X / RT-2-X) |
| Control / Dynamics | Execute precise motions, stabilize, react to dynamic feedback | Generative Model-based RL (STORM) |
Agent decomposition: Serialization
Agent decomposition can itself be learned through chain of thought reasoning
Attention-based transformers can infer useful sub-agents, subgoals, or roles from by serialization of context in token stream
- Decomposes a difficult task into subproblems
- Assigns roles, tools, or expertise to sub-agents
- Coordinate their interaction through shared context
- Learn useful decompositions directly from experience or data
In this sense, agentic structure need not be engineered entirely by hand; it can be learned from human data, experience or self-supervision (simulation).
Example: OpenAI’s Codex Command Line Interface (CLI)
OpenAI’s Codex CLI serves as a primary example of how Chain-of-Thought (CoT) prompts can naturally transition into Agent Decomposition for software tasks.
- CLI Workflow: By interfacing via the command line, we treat the LLM as a “Manager Agent” that breaks down high-level requirements into executable sub-tasks.
Architectural Training:
Prompting by Example: Use few-shot prompting \(+\) examples of successful system architectures to “train” the model on how to delegate.
Reasoning-to-Execution: CoT allows the agent to think (e.g., “I need a database schema first, then the API agents”) before the CLI executes the file creation.
Feedback Loop:
Self-Correction: The agent can inspect CLI error outputs to refine its decomposition logic in real-time.
Modularity: High-level CoT ensures each module is built in isolation, mimicking a multi-agent environment within a single interface.
“The conversation with the CLI essentially orchestrates a system of sub-agents directed by a serialised chain of thought reasoning chain.”
Agent decomposition challenge: Multi-Agent Systems
The decomposition of agents into multiple sub-agents, together with their respective roles, coordination mechanisms, and communication protocols remains an important challenge in the agentic approach.
Research and development into multi-agent systems has a long and fertile history in artificial intelligence
- There are many solutions to agentic problems in the existing literature.
Example: OpenClaw: agent RL from live interaction

OpenClaw is a framework for training LLM-based agents, such as Codex CLI/Claude, online, from normal usage.
Core idea: after each action, the agent observes the next state:
- user reply
- tool output
- terminal result
- GUI change
These next states are treated as RL feedback signals.
OpenClaw: OpenClaw-RL
OpenClaw-RL allows for agents to learn and improve by simply talking to your agents
- OpenClaw uses a training objective based on a standard PPO-style clipped surrogate (covered later in this module).
OpenClaw: moltbook & ClawHub
OpenClaw has experienced record-breaking growth since its launch in November 2025
- OpenClaw is frequently used to write agents which run on moltbook
- ClawHub is a forum where you register and discover new skills.
OpenClaw: Sequential Markov Decision Problem (MDP)
OpenClaw frames agent as a sequential, Markov Decision Problem:
- state = current conversational / tool context
- action = generated response or tool-use step
- transition = what happens next in the environment
- reward = inferred from the resulting next state
OpenClaw reframes ordinary agent interaction as an MDP-like loop - rather than relying only on static preference datasets, it learns from what actually happens after the model acts.
OpenClaw: RL Techniques
Key RL idea: learn from experience, not just offline data. It utilises two kinds of signal from the next state:
- evaluative signal: how good the action was
- directive signal: how the action should be improved
OpenClaw’s emphasises is on a fully asynchronous setup:
- servers,
- rollout collection,
- judging / reward estimation, and
- training
all run concurrently
OpenClaw: RL Techniques (continued)
It connects LLM agents back to standard RL ideas
- but in open-ended, tool-using, partially observed environments
Limitation:
- reward estimation is indirect and depends on judges / heuristics
- stability and safety rely on multi-agent architecture techniques
OpenClaw uses tools, shells, and GUIs.
Example: MiniMax / Forge: large-scale RL for agents

MiniMax is an RL framework for training agentic foundation models at scale.
- Associated with MiniMax M2.5 is a mixture of experts (MoE) foundation model trained across many real-world task environments
- Forge is the RL framework/infrastructure, the trainer, specialises in connecting models with live environments (terminal, browsers, code repositories etc.)
Main concern: scaling RL while balancing three competing goals:
- throughput
- stability
- agent flexibility
MiniMax / Forge: large-scale RL for agents (continued)
The emphasis is less on a single elegant RL algorithm, and more on:
- scalable infrastructure
- asynchronous scheduling
- efficient rollout + training pipelines
- composite reward design
MiniMax / Forge: why it matters for RL
Reinforcement learning is being pushed into messy, long-horizon productivity tasks:
- coding
- search
- office workflows
- tool use
Key insight:
- frontier RL is increasingly about training agents in many realistic environments
MiniMax can be used by no-code agent platforms such as MindStudio
RL can optimise task decomposition and tool-use policies for MiniMax / Forge
- rewards may be composite, delayed, noisy, and environment-specific
- scaling requires attention to off-policy effects, variance, and system design
Comparison with classical RL:
- same basic loop: act \(\rightarrow\) observe \(\rightarrow\) evaluate \(\rightarrow\) update
- different regime: huge contexts, tool chains, heterogeneous tasks, expensive rollouts
Planning & RL for Transformer Models
Planning & RL techniques for Transformers?
Sequential decision-making is on the ascendant:
RL, model-based control, and planning-like reasoning are central to agents, robotics, and tools using tensor flow (transformer) architectures.
So “planning” and “RL” must live inside parallel, scalable neural network systems.
Limiting assumptions:
- Classical planning & Reinforcement Learning’s typical teaching setup (fully observable, deterministic, stationary, discrete) mismatches many modern AI settings.
Planning & RL techniques for Transformers (continued)
Compute & tooling:
Tree search doesn’t map cleanly onto GPU/TPU throughput the way dense tensor operators do
Differentiability matters for end-to-end training, credit assignment, and integration with deep stacks
Question: What is required to run RL & Planning GPUs?

Nvidia’s liquid cooled GB200 Grace Blackwell (Tensor Core) Superchip can connect up to 576 of Blackwell GPUs in a single domain with over 1 PB/s total bandwidth (image by 极客湾Geekerwan, CC BY 3.0, Link)
What is required to run RL & Planning on a GPU?
- During training?
- At inference (query time)?
Value & Policy Approximation: Soft Actor-Critic (SAC)
Soft Actor-Critic (SAC): Differentiable Planning
Greedy next-step choice using max
- defines the Bellman optimality operator used in Q-learning/DQN. \[ Q^*(s,a) = r(s,a) + \gamma \,\mathbb{E}_{s'\sim P}\!\left[\max_{a'} Q^*(s',a')\right] \]
Soft (Entropy-Regularized) Bellman Backup - “softmax” \[ \begin{align*} Q_{\text{soft}}(s,a) & = r(s,a) + \gamma \,\mathbb{E}_{s'\sim P}\!\left[ V_{\text{soft}}(s')\right] \quad \\[0pt] V_{\text{soft}}(s) & = \mathbb{E}_{a'\sim \pi(\cdot|s)}\!\left[ Q_{\text{soft}}(s,a') - \alpha \log \pi(a'|s)\right] \end{align*} \]
Adds entropy bonus (temperature \(\alpha\)) ⇒ softens the hard max.
As \(\alpha\!\to\!0\): \(V_{\text{soft}}(s)\!\to\!\max_{a'} Q(s,a')\) (recovers hard backup).
Soft Actor-Critic (SAC): Implementation note
A min over two critics, \(\min\{Q_{\theta_1},Q_{\theta_2}\}\), is often used to reduce overestimation bias (Double-Q trick), not as the backup operator.
- This relaxation allows gradients to flow through the planning step.
Example: Table Tennis Player using Soft Actor-Critic (SAC)
Sony-AI’s Ace player is trained entirely on single shots in simulation using the Soft Actor-Critic (SAC) algorithm to return ball with desired skill, beating elite players.
- Durr et al., (2026) Outplaying elite table tennis players with an autonomous robot, Nature https://www.nature.com/articles/s41586-026-10338-5
Imagination & Latent Rollouts: STORM
STORM (Stochastic Transformer World Models): Atari

Stochastic Transformer-based wORld Model (STORM) targets the Atari 100k benchmark games, like MuZero.
STORM: How it works
A world model-based RL method for learning in imagination (simulation inside a learned world model)
It combines:
- Categorical Variational Autoencoder (VAE): Encodes raw images into discrete latent “tokens.”
- Transformer Dynamics: An attention-based sequence model that predicts future states (imagination).
- Stochastic Latent Variable: Accounts for environmental uncertainty and randomness.
- Actor-Critic Agent: A separate neural network learns the optimal policy by “practising” entirely within the Transformer’s imagined rollouts.
STORM: Imagination & Latent Rollouts
Repeating the steps in imagination forms a trajectory.
This allows the agent to learn a policy (how to act) or a value function (how “good” a state is) purely by practising inside its own “head”.
Imagination is technically known as latent rollouts, a widely accepted approach for generative model-based reinforcement learning.
STORM: Motivation
Transformers are strong at long-range sequence modelling.
Stochastic latents help capture uncertainty and non-determinism
- i.e. a stochastic latent says, given this past, there may be several plausible next hidden states
STORM achieves 126.7% mean human-normalised score on the Atari 100k benchmark, a state-of-the-art result among methods without look-ahead search.
MuZero branches over actions with MCTS, whereas STORM rolls forward imagined latent sequences in a transformer world model
- This makes STORM more GPU-friendly and more naturally aligned with end-to-end differentiable learning.
Comparison: MuZero versus STORM
| Aspect | MuZero | STORM |
|---|---|---|
| Core idea | Combines learning + Monte-Carlo Tree Search (MCTS) in latent space | Learns a stochastic transformer world model and performs imagination rollouts |
| Planning form | Expands a search tree: (s_0 s_1, s_2, ) | Rolls out a latent sequence using a transformer world model: (z_t z_{t+1} z_{t+2} ) |
| Model components | Representation (h(o_t)), Dynamics (g(s,a)), Prediction (f(s)) | Tokenizer / latent encoder, stochastic transformer dynamics model, reward/value/policy heads |
| Computation | Search-based, often CPU-heavy, not end-to-end differentiable through search | GPU-friendly, transformer-based, trained through batched imagined trajectories |
| Learning loop | Tree search generates improved policies; network distils them via supervised losses | Actor–critic trained on imagined trajectories generated by the transformer world model |
| Search structure | Discrete branching, value backups | Sequential imagination, typically no explicit branching tree search |
| Output policy | Derived from visit counts in the search tree | Learned directly through gradient updates from imagined rollouts |
| Analogy | “Plan by explicit search” | “Learn through transformer imagination” |
Policy Gradient: Proximal Policy Optimisation (PPO)
PPO Fine tuning ChatGPT & human feedback (Revisited)
- Proximal policy optimisation (PPO) is used by GPT-3.5 onwards
PPO in Agentic AI
PPO is becoming popular for Agentic AI modes, and is used in
- GPT’s Operator, and
- Claude’s Computer Use modes
PPO is also becoming popular for World Models in robotics, and is used in
- Vision, Language & Action (VSA) attention-based transformers
- Robot RT-X transformers
Proximal Policy Optimisation (PPO) for LLMs
PPO treats the LLM as a policy over tokens and updates it using reward feedback.
The purpose is to make better responses more likely, but without letting the model drift too far in a single update. This makes reinforcement-learning-based fine-tuning much more stable.
- generate responses
- score them with a reward model or preference signal
- update the model toward higher-reward outputs
- constrain the update so the change remains small
PPO is a stability-oriented way to nudge an LLM toward higher-reward behaviour.
Proximal Policy Optimisation (PPO) for LLMs (continued)
Goal: Stable, sample-efficient policy improvement
Idea: Constrain how far the new policy moves from the old one at each update during actor-critic cycle
Policy Objective Functions (Revisited)
From a policy gradient perspective, the true objective function \(J^{{\pi}_{\theta}}(\theta)\) over policy \(\pi_{\theta}\) is defined as follows
\[ J^{{\pi}_{\theta}}(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}} [G_0]=V^{\pi \theta}(d_0) \]
where \(d_0\) is the distribution over initial states at the start of each episode \(\tau\)
However, there are two problems in the actor-critic setting:
1. Computing \(J^{{\pi}_{\theta}}(\theta)\) exactly would require integrating over all trajectories, \(\tau\), of the current policy, which is impractical
2. If we update the parameters \(\theta\), it will effect objective value during the optimisation process, leading to (circular) feedback
We therefore need a surrogate objective independent of the trajectory distribution under the new policy \(\color{blue}{\pi_{\theta}}\) we are building
Surrogate Objective
From the policy-gradient theorem, we can define the importance ratio
\[ r_t(\theta) \;=\; \frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\text{old}}}(a_t\mid s_t)} \qquad \]
We now define the surrogate objective \(L_{PG}\) for the true objective
\[ L_{PG}(\theta) \;=\; \mathbb{E}_t\!\big[\, r_t(\theta)\,\hat A_t \,\big] \]
Where \(\hat{A}_t\) captures how much better action \(a_t\) was than the state’s average
\(\hat{A}_t\) is an estimator of the true advantage function \(A^{\pi}\)
\(A^{\pi}(s_t,a_t) = Q^{\pi}(s_t, a_t) - V^{\pi}(S_t)\)
Clipped Surrogate (Core Idea)
Kullback-Leibler (KL) divergence theory tells us we want improvement without overly large steps in policy space, so we define
\[ L^{\text{CLIP}}(\theta) \;=\; \mathbb{E}_t\! \left[ \min\!\Big( r_t(\theta)\,\hat A_t,\; \mathrm{clip}\!\big(r_t(\theta),\,1-\epsilon,\,1+\epsilon\big)\,\hat A_t \Big) \right] \]
If \(r_t(\theta)\) leaves the interval \([1-\epsilon,\,1+\epsilon]\), the objective is clipped.
Typical range for \(\epsilon \in [0.1,\,0.2]\).
Prevents destructive updates while preserving ascent direction
The clipped surrogate objective in PPO plays a similar stabilising role to compatible function approximation — both constrain policy updates so that gradient estimates remain accurate and unbiased with respect to the true policy improvement direction.
Complete PPO loss
\[ L^{\text{PPO}}(\theta) = \mathbb{E}_t\!\Big[{\color{blue}{L^{\text{CLIP}}(\theta)}} - c_1\,{\color{red}{\big(V_\theta(s_t)-V_t^{\text{target}}\big)^2}} + c_2\,\mathcal{H}\!\left[\pi_\theta(\cdot\mid s_t)\right] \Big] \] \(\;\;\;\;\;\;\;\;\;\;\;\;\;\;\)where \(c_1, c_2\) are coefficients
The actors policy gradient (surrogate objective) is \(\color{blue}{L^{CLIP}(\theta)}\)
- This encourages the policy to increase the probability of actions with positive advantage and decrease it for negative ones
The critics value function is \(\color{red}{\big(V_\theta(s_t)-V_t^{\text{target}}\big)^2}\)
- This trains the network to predict correct returns (mean-squared error).
PPO Entropy bonus
The entropy bonus \(\mathcal{H}\) encourages exploration
\[ \mathcal{H}\big[\pi_\theta(\cdot \mid s_t)\big] = - \sum_{a} \pi_\theta(a \mid s_t) \log \pi_\theta(a \mid s_t) \]
The entropy term encourages exploration by rewarding stochastic (uncertain) policies.
It’s high when the policy is uncertain or “spread out” (exploratory).
It’s low when the policy is confident or deterministic.
The dot “\(\cdot\)” in \(\pi_{\theta}(\cdot | s_t)\) means over all possible actions, i.e. the vector of probabilities \(\pi_{\theta}(a_1,s_t), \pi_{\theta}(a_2,s_t), \ldots\)
In practice this maintains stochasticity until policy becomes more confident or deterministic.
Generalised Advantage Estimation (GAE)
In practice, PPO uses a low-variance, low-bias estimate of the advantage \(A^\pi(s_t,a_t)\).
TD error: \[ \delta_t \;=\; r_t + \gamma\,V_\phi(s_{t+1}) - V_\phi(s_t) \]
GAE-\(\lambda\): \[ \hat A_t^{(\lambda)} \;=\; \sum_{l=0}^{\infty} (\gamma\lambda)^l\,\delta_{t+l} \;=\; \delta_t + \gamma\lambda\,\delta_{t+1} + (\gamma\lambda)^2\,\delta_{t+2} + \cdots \]
Return/target used for critic \[ \hat V_t^{\text{target}} \;=\; \hat A_t^{(\lambda)} + V_\phi(s_t) \]
- \(\lambda\in[0,1]\) trades bias \(\leftrightarrow\) variance, typical PPO: \(\gamma \approx 0.99\), \(\lambda \approx 0.95\).
PPO Algorithm
- Multiple epochs over the same batch are okay because clipping limits drift
Why PPO works (intuition)
First-order solution method
- No additional constraint solving required which can introduce second-order effects
Trust-region-like behaviour via clipping
- Optimisation within trust-region involves only taking steps that stay within a region where local approximation is reliable (founded on KL information theory)
Robust across discrete/continuous control
- Strong baseline performance in practice
Policy Optimisation Techniques
DeepSeek’s R1 uses Group Relative Policy Optimization (GRPO), forming the core of the reasoning process, and operates over the reasoning token stream.
Shao et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, arXiv:2402.03300v3, pp1-30, 2024
DeepSeek-R1 uses a “pure RL” approach (DeepSeek-R1-Zero) that utilises self-supervised trial and error approach, without the need for human supervision, rather like MuZero.
GRPO can be used for knowledge distillation, where a student model mimics a teacher model via supervised fine-tuning (SFT). The “Teacher” model acts as the Reward Function.
Policy Optimisation Techniques (continued)
Direct Preference Optimisation (DPO)
Rafailov et al. Direct Preference Optimization: Your Language Model is Secretly a Reward Model , arXiv:2305.18290v3, pp1-27, 2023
While GRPO & PPO dominate online RL for reasoning, DPO remains the industry standard for offline preference alignment
Offline data must include high-quality datasets, i.e. chosen vs. rejected pairs.
Policy optimisation utilisation
| Foundation Model | RL/Query Optimiser | Example |
|---|---|---|
| Attention-based transformer / LLM | PPO, DPO, GRPO / CoT | GPT, Gemini, Claude, DeepSeek-R1 - Thinking, Programming, Operator and Computer Use modes; OpenClaw-RL etc. |
| Attention-based transformer / Agentic & Vision + Language + Action (VLA) | PPO | OpenClaw-RL, MiniMax, VLA, RT-X, |
Chain of thought (CoT)
Multi Actor-Critic: IMPALA & V-trace
Distributed RL: IMPALA & V-trace
Importance Weighted Actor-Learner Architecture (IMPALA)
In production in DeepMind, OpenAI & Google DeepRL
Decoupled actor-learner: many CPU actors generate trajectories under behaviour policy \(\mu\); a central GPU learner updates \(\pi\).
High throughput via batched unrolls (e.g., length (n)); supports RNNs (LSTM) and multi-task.
Challenge: policy lag \(\rightarrow\) off-policy data.
Solution: V-trace targets for stable off-policy learning.
Off-policy with correction that handles policy lag without sacrificing throughput
IMPALA was developed at DeepMind
Distributed RL: V-trace essentials
Let importance ratios \(\displaystyle \rho_t=\min\!\left(\bar{\rho}, \frac{\pi(a_t|x_t)}{\mu(a_t|x_t)}\right)\)
\(\displaystyle c_t=\min\!\left(\bar{c}, \frac{\pi(a_t|x_t)}{\mu(a_t|x_t)}\right)\) with \(\bar{\rho}\ge \bar{c}\)
Value target (per time \(s\))
\[
\delta_t^{V} \;=\; \rho_t\Big(r_t + \gamma\,V(x_{t+1}) - V(x_t)\Big),\;\;v_s \;=\; V(x_s) \;+\; \sum_{t=s}^{s+n-1} \gamma^{\,t-s}
\!\left(\prod_{i=s}^{t-1} c_i\right)\! \delta_t^{V}
\]
Policy gradient with V-trace advantage
\[
A_t^{\text{V-trace}} \;=\; r_t + \gamma\,v_{t+1} - V(x_t), \qquad
\nabla_\theta J \;\propto\; \rho_t\,\nabla_\theta \log \pi_\theta(a_t|x_t)\,A_t^{\text{V-trace}}
\]
Loss (typical)
\[
\mathcal{L} \;=\; \underbrace{\mathbb{E}\big[(v_s - V(x_s))^2\big]}_{\text{value}}
\;-\; \beta\,\underbrace{\mathbb{E}\big[\rho_t \log \pi(a_t|x_t)\,A_t^{\text{V-trace}}\big]}_{\text{policy}}
\;-\; \eta\,\underbrace{\mathbb{E}\big[\mathcal{H}(\pi(\cdot|x_t))\big]}_{\text{entropy}}
\]
Why it works
Clipped IS ratios \((\rho_t, c_t)\) tame variance/bias;
Multi-step correction handles policy lag without sacrificing throughput.
Representative efficient actor-critic methods
| Category | Example algorithms | Key strengths |
|---|---|---|
| On-policy | PPO | Stable, parallelizable, easy; standard in LLM fine-tuning (RLHF) |
| Off-policy (stochastic) | SAC | Maximum-entropy objective → robust exploration; excellent data efficiency |
| Distributed | IMPALA, V-trace | Massive scalability; production in DeepMind, OpenAI, Google DeepRL |
Efficiency and Performance Comparison
| Dimension | MuZero / Sampled MuZero / EfficientZero | PPO / SAC / IMPALA |
|---|---|---|
| Sample efficiency | Excellent when planning can reuse a model (Atari, board games) | High for off-policy (SAC); moderate for PPO |
| Wall-clock / GPU efficiency | Poor (search is serial & CPU-bound) | Very good (fully parallel on GPU) |
| Robustness & stability | Sensitive to model errors / rollout length | Stable with tuned hyper-parameters |
| Scalability to real-time tasks | Hard (search latency) | Good; used in robotics, continuous control, large-scale RL (IMPALA, V-trace) |
| Best-case performance | Outstanding in structured domains (Go, Atari) | State-of-the-art in most continuous-control and real environments |
Vision & Robot Transformers: ViTs & RT-X
Example: Vision Transformers (ViTs)
Since ~2020, attention-based Transformers have started competing and often surpassing CNNs in large-scale vision benchmarks.
Image \(\rightarrow\) patches \(\rightarrow\) tokens \(\rightarrow\) transformer
Patchify the image: split an image of size (HW C) into non-overlapping patches (PP).
Number of tokens \(N=\frac{HW}{P^2}\).Linear patch embedding: flatten each patch \(x_i\in\mathbb{R}^{(P^2C)}\) and project
\(z_i^0 = W_E x_i + b_E \in \mathbb{R}^D\).
(Often implemented as a conv with kernel/stride \(P\).)
Vision Transformers - Tokens and Transformer Encoding
Add a class token and positions: prepend \([\mathrm{CLS}]\) and add learnable positions
\(\tilde{z}_i^0 = z_i^0 + p_i\), with sequence \([\tilde{z}_{\text{CLS}}^0, \tilde{z}_1^0,\ldots,\tilde{z}_N^0]\).Transformer encoder stack (repeated (L) times):
\(\text{SA}(X) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V\) with multi-head self-attention,
then MLP; both with residuals + layer norm.Prediction head: take the final \([\mathrm{CLS}]\) (or pool all tokens) \(\to\) linear head () class probs.
Note: smaller \(P\) \(\Rightarrow\) more tokens (detail ↑, cost ↑); larger \(P\) \(\Rightarrow\) fewer tokens (detail ↓, cost ↓).
Variants like Swin use local windows with shifts for scalability; ViT uses global attention.
Example: RT-X Robot Transformers
RT-X:
Increasingly transformers are also being used for robotics (e.g. RT-1, RT-2, RT-X Google DeepMind)
- large-scale imitation across many robots.
“RT-family” includes hybrid attention across vision, language, and control.
- They utilises Visual, Language, Action (VLA) transformers
References
Transformers: Vaswani et al. (2017), Attention Is All You Need NeurIPS 2017 arXiv:1706.03762
PPO: Schulman et al. (2017), Proximal Policy Optimization Algorithms arXiv:1707.06347
DPO: Rafailov et al. (2023), Direct Preference Optimization NeurIPS 2023 arXiv:2305.18290
GRPO: Shao et al. (2024), DeepSeekMath arXiv:2402.03300
DeepSeek-R1 Guo et al. (2025) DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning, Nature https://www.nature.com/articles/s41586-025-09422-z
IMPALA: Espeholt et al. (2018), IMPALA ICML 2018 arXiv:1802.01561
STORM: Zhang et al. (2023), STORM: Efficient Stochastic Transformer based World Models for Reinforcement Learning NeurIPS 2023 arXiv:2310.09615
OpenClaw-RL: Wang et al. (2026), OpenClaw-RL: Train Any Agent Simply by Talking arXiv:2603.10165
Ace (Sony-AI) Durr et al., (2026) Outplaying elite table tennis players with an autonomous robot, Nature https://www.nature.com/articles/s41586-026-10338-5
PDDL with LLMs Silver et al. (2024) Generalized planning in PDDL domains with pretrained large language models https://ojs.aaai.org/index.php/AAAI/article/view/30006