12 Planning & RL for Transformers (Advanced Topic)

Slides

This module is also available in the following versions

This Module

Last Module: Deep learning & Tree Search

This Module: Planning & RL for Transformers

What is covered? How to implement planning and RL on GPU/TPUs.

Introduction to Attention-based Transformers

Attention-based Transformers & Implicit Planning

Attention-based transformers can simulate multiple futures inside a hidden state.

Self-attention makes lookahead possible:

  • Attention learns which futures matter
  • Transformers can simulate “multiple futures” inside the hidden state

Implicit planning is an amortised form of planning, i.e. pay the cost of planning upfront during training, so that at test time the system can act without running a full planner or search each time.

  • Attention-based transformers have the capacity to use neural network model’s implicit planning capacity

The Elements of Attention-based Transformers

  1. Autoregression & joint distribution factorisation
  2. Masked latent transformers
  3. Attention-based transformers
  4. Causal masking

1. Autoregression & joint distribution factorisation

Autoregression models a sequence by predicting each element from all previous elements, i.e. \(x_t = f(x_{1:t-1}) + \epsilon\) where \(\epsilon\) is the noise or uncertainty. It factorises a joint distribution as follows:

\[ p(x_{1:T}) = \prod_{t=1}^{T} p(x_t \mid x_{1:t-1}) \]

where \(p(x_{1:T})\) is the probability of seeing the entire sequence.

  • Used in time series, sequence modelling, and autoregressive transformers.

  • At inference time, predictions are fed back step-by-step to generate sequences.

To enforce causal structure, transformers use masking (next step).

2. Causal latent transformers

A causal latent transformer is a transformer that models sequences of latent states, using attention masks to enforce causality or planning structure, just like in autoregressive language models.

  • It is a transformer that predicts (or refines) latent variables in a sequence, but only using allowed past or partial information, enforced through a mask.

Causal latent transformers appear in world-model RL, video generation, and planning-as-generation frameworks (implicit planning in neural networks).

  • They are increasingly used as a replacement for recurrent neural network (RNN) and latent dynamics models, such as RSSM used in Dreamer-like systems (Note that Dreamer or recursive approaches are not covered in this Module).

3. Attention-based transformers

Transformers model sequences using self-attention, where each token computes weighted interactions with other tokens.

Key components:

  • Token embeddings and positional encodings

  • Self-attention:

\[ \mathrm{Attn}(Q, K, V) = \mathrm{softmax}\!\left( \frac{QK^\top}{\sqrt{d_k}} \right) V \]

where \(Q=\mathit{queries}\), \(K=\mathit{keys}\), \(V=\mathit{values}\) and \(d_k\) is the dimension of the key vectors (layers of the network) and \(\top\) corresponds to matrix transpose.

  • Multi-head attention, feed forward layers, residual connections

4. Causal masking

Causal masking is used in autoregressive transformers:

  • Each token attends only to past tokens

  • Enforces the autoregressive condition

\[ x_t \sim p(x_{t+1} \mid x_{1:t-1}) \]

where \(\sim\) means sampled from.

This condition means that when predicting the next token, the model is only allowed to use earlier tokens, not future ones.

This masked attention mechanism directly carries over to masked latent transformers used in world-model RL.

Masked latent transformers & languages

The original Transformer was introduced as an attention-only sequence model, initially for natural language translation

  • but its broader importance is that it has became a general computational approach for modelling structured representations.
  • Masked latent transformers have been shown to be Turing complete.

Importantly, transformers have shown that multiple languages can be treated as sequences or structured token streams

  • For example English (or any language), Python, Markdown/LaTeX (these slides), mathematical proofs, SQL, even multi-modal representations.

Visualising attention-based transformers

A useful visualisation approach, for understanding attention-based transformers is as follows:

  • Horizontal Axis (Tokens/Sequence): Each position corresponds to a specific token (word or sub-word) in your input.

  • Vertical Axis (Layers/Depth): As you move “up” the network layers, the model is building more abstract, contextual meanings. The bottom layers might focus on simple syntax, while higher layers handle complex relationships and intent.

  • Attention Intersections: The “links” connecting horizontal tokens across the vertical layers represent the Attention Weights, showing which other words the model is looking at to understand the current word.

Visualising attention-based transformers (continued)

This 3Blue1Brown video demonstrates the horizontal (tokens) / vertical (layers) visualisation.

Example: Masked latent transformers in LLMs

For large language models (LLMs) in natural language processing

  • Tokens = words

  • Mask = causal mask (can’t see the future)

In masked latent transformers:

  • Tokens = latent states \(z_t\) (learned hidden representations)

  • Mask = ensures correct temporal, causal, or planning structure

Example: LLMs for PDDL

Pretrained LLMs can also harness PDDL for generalised planning by being prompted with a PDDL domain and a small number of training problem instances.

  • A solution to generalised planning finds a reusable policy that solves many related problem instances, not just one specific initial state.

  • GPT-4 is used to synthesise and output a Python generalised planner, the program is tested and if it fails, GPT-4 is re-prompted with debugging feedback.


GPT-4 outperforms one of the leading domain-independent classical planner Fast Downward on generalised planning on benchmarks

  • GPT-4 is shown to be 1-2 orders of magnitude faster during inference but does not quite match coverage for benchmarks (misses a few). Although subsequent models are likely to improve this.

  • Silver et al. (2024) Generalized planning in PDDL domains with pretrained large language models https://ojs.aaai.org/index.php/AAAI/article/view/30006

Serialization of Reasoning: Chain of Thought

Serialization of Chain of Thought Reasoning

Reasoning in attention-based transformers is implicit because reasoning happens inside a serialized, single-token stream

  • Reasoning tokens to represent a reasoning process capturing chain of thought reasoning steps
  • Self-Reflection & Correction: Serialization facilitates self-reflection via specialised tokens (e.g., “wait,” “but”) which act as triggers for error-checking and hypothesis reassessment.
  • Computational State: Tokens serve as a persistent “scratchpad,” allowing the model to offload intermediate results and extend its computational budget.
  • Backtracking: Enables the model to autonomously identify and pivot away from incorrect logical branches without external reward models.

Example: DeepSeek-R1 (Mathematics)

1. Initial Logic & Calculation
The model begins solving a geometry puzzle by mapping variables.

Reasoning Token Stream

“To find the area, I first need the radius. The circumference is \(20\pi\), so \(2\pi r = 20\pi\), which means \(r = 10\). Now, the area is \(\pi r^2\), so \(10 \times 10 = 100\)…”

2. Self-Correction
A trigger word (“Wait”) signals model has detected a constraint violation.

Reasoning Token Stream (continued)

“…Wait. I misread the prompt. The \(20\pi\) was the area of a different circle, not the circumference. Let me re-evaluate.”


3. Alternative Strategy Exploration
The model discards the old path and restarts with the correct data.

Reasoning Token Stream (continued)

“…If the area is \(20\pi\), then \(\pi r_1^2 = 20\pi\), so \(r_1 = \sqrt{20}\). The prompt says the second radius is double the first…”

4. Final Verification
Before outputting, the model checks for internal consistency.

Reasoning Token Stream (continued)

“…So \(r_2 = 2\sqrt{20}\). Area = \(\pi(2\sqrt{20})^2 = 80\pi\). Does this make sense? Yes, \(80\pi\) is \(4\times\) the original area, matching the square of the radius doubling. Final Answer: \(80\pi\).

Example: GPT o1, Claude 3.7 Sonnet & DeepSeek R1

GPT-o1 and subsequent models utilise chain of thought as mechanisms within reasoning.

  • o1 introduced chain of thought style reasoning at inference in 2024, which enables training using RL for teaching the model to reason before answering

Claude 3.7 Sonnet introduced chain of thought reasoning in 2025.

DeepSeek-R1 utilises chain of thought

  • R1’s utilised Group Relative Policy Optimisation (GRPO), a memory efficient fine tuning technique harnessing chain of thought at training and inference (covered later in this Module).

Example: Gemini Deep Think

Gemini Deep Think is a reasoning mode designed for hard problems in mathematics, coding, planning, and science.

Rather than producing one immediate answer, it uses:

  • Extended inference-time compute
    More “thinking time” before answering.

  • Parallel thinking
    Multiple hypotheses or solution paths are explored and compared.

  • Critique and revision
    Candidate answers can be checked, refined, or rejected.

  • RL-shaped reasoning behaviour
    Google describes novel reinforcement learning techniques that encourage the model to use longer, multi-step reasoning paths.

Example: Mathematics Gold Medal (Gemini Deep Think)

DeemMind’s Gemini Deep Think Model wins Gold Medal at International Mathematical Olympiad (IMO)

Example: Gemini Deep Think (RL analogy)

Deep Think can be viewed as moving from answer prediction toward policy improvement over reasoning traces.

RL concept Deep Think analogy
State Current problem + partial reasoning context
Action Generate, branch, critique, revise, verify
Reward Correctness, coherence, problem-solving success
Policy Model’s learned strategy for choosing reasoning steps
Search Parallel exploration of possible solution paths

Example: Chain of thought in Anthropic’s Mythos

Anthropic’s Mythos model utilises human supervised chain of thought

  • Mythos is useful for penetration testing in cyber security

  • Utilises human supervised reasoning traces from chains of thought as a training target

Human supervision steers the model toward which vulnerabilities would be genuinely serious, dangerous bug classes, and potential security flaws

  • The model is trained to learn meaningful exploit hypotheses

  • Classical planning was originally utilised for penetration testing, through the development of PDDL represented attack models.

Faithfulness: Human Supervised Chain of Thought

Human supervised chain of thought models can create a safety challenge, as the model is trained to generate reasoning traces that humans expect

  • The risk is that the model learns to produce reasoning that looks acceptable rather than reasoning that is fully faithful

  • This can lead to what appears from a human perspective to be deceptive behaviour by the model, even though it makes sense from the attention-based transformer’s loss perspective.

Faithfulness: Open Problem in LLMs

Faithfulness of chain-of-thought reasoning under human supervision remains an unresolved problem

  • Most LLMs do not directly train on human-labelled reasoning traces as ground-truth targets (or do so only in limited ways)

There is the key challenge:

  • Performance: reasoning-trace supervision can improve capability (e.g. decomposition, verification, structure)

  • Faithfulness: the trace may not reflect the model’s true internal reasoning

A reasoning trace can be, useful, plausible and reward-optimised without being causally faithful, i.e. Improving reasoning \(\neq\) ensuring faithful explanations

Example: Edge Computing (Gemma 4: Open Multimodal Model)

DeepMind released Gemma 4 on 2 April, 2026.

Performance: The 31B model acts as a “server-grade” model for local use, while smaller variants (2B/4B) are optimised for mobile, edge and on-device.

Agentic Capabilities: Built for multi-step planning, autonomous actions, and tool use.

Multimodal: Natively supports text, images, and audio, along with dynamic vision resolution.

Large Context: Supports up to 256,000 token context windows

Open-Source: Released under the Apache 2.0 licence

Knowledge distillation is a key element in training Gemma 4.

Agentic (Multi-Agent) Transformers

Major paradigms

Major paradigms are emerging which reflect a convergence of RL with autoregression and transformer models.

  • Multi Actor-Critic Transformers (online, no dynamics).

  • World-Model RL (planning, includes dynamics)

  • Vision (multi-modal)

  • Agentic (roles)

Example: Actor-Critic Transformers

Actor-critic transformers are RL agents that use transformer architectures to parameterise

  • the actor, critic, or both,
  • while still relying on Bellman equations and policy gradients for learning.

Actor-Critic Transformers essentially outperform Long Short-Term Memory (LSTMs) networks in long-horizon Partially Obervable Markov Decision Processes (POMDPs)

Three Layer Agent Stack

Layer Role (e.g. of an Agent or Robot) Typical tools
Cognition / Reasoning Query answering, Programming, multi-step thinking & planning: goal decomposition, tool selection, self-reflection, backtracking & correction, safety checks etc. LLM with serialized Chain of Thought (GPT, Gemini, Claude, DeepSeek - Thinking, Programming & Operator modes; OpenClaw, MiniMax, etc.)
Semantic Policy (Vision–Language–Action) Grounds instructions & scene into actionable subgoals / waypoints Vision-Language-Action (VLA) Transformers (RT-X / RT-2-X)
Control / Dynamics Execute precise motions, stabilize, react to dynamic feedback Generative Model-based RL (STORM)

Agent decomposition: Serialization

Agent decomposition can itself be learned through chain of thought reasoning

Attention-based transformers can infer useful sub-agents, subgoals, or roles from by serialization of context in token stream

  • Decomposes a difficult task into subproblems
  • Assigns roles, tools, or expertise to sub-agents
  • Coordinate their interaction through shared context
  • Learn useful decompositions directly from experience or data

In this sense, agentic structure need not be engineered entirely by hand; it can be learned from human data, experience or self-supervision (simulation).

Example: OpenAI’s Codex Command Line Interface (CLI)

OpenAI’s Codex CLI serves as a primary example of how Chain-of-Thought (CoT) prompts can naturally transition into Agent Decomposition for software tasks.

  • CLI Workflow: By interfacing via the command line, we treat the LLM as a “Manager Agent” that breaks down high-level requirements into executable sub-tasks.

Architectural Training:

  • Prompting by Example: Use few-shot prompting \(+\) examples of successful system architectures to “train” the model on how to delegate.

  • Reasoning-to-Execution: CoT allows the agent to think (e.g., “I need a database schema first, then the API agents”) before the CLI executes the file creation.


Feedback Loop:

  • Self-Correction: The agent can inspect CLI error outputs to refine its decomposition logic in real-time.

  • Modularity: High-level CoT ensures each module is built in isolation, mimicking a multi-agent environment within a single interface.

“The conversation with the CLI essentially orchestrates a system of sub-agents directed by a serialised chain of thought reasoning chain.”

Agent decomposition challenge: Multi-Agent Systems

The decomposition of agents into multiple sub-agents, together with their respective roles, coordination mechanisms, and communication protocols remains an important challenge in the agentic approach.

Research and development into multi-agent systems has a long and fertile history in artificial intelligence

  • There are many solutions to agentic problems in the existing literature.

Example: OpenClaw: agent RL from live interaction

OpenClaw is a framework for training LLM-based agents, such as Codex CLI/Claude, online, from normal usage.

Core idea: after each action, the agent observes the next state:

  • user reply
  • tool output
  • terminal result
  • GUI change

These next states are treated as RL feedback signals.

OpenClaw: OpenClaw-RL

OpenClaw-RL allows for agents to learn and improve by simply talking to your agents

  • OpenClaw uses a training objective based on a standard PPO-style clipped surrogate (covered later in this module).

OpenClaw: moltbook & ClawHub

OpenClaw has experienced record-breaking growth since its launch in November 2025

  • OpenClaw is frequently used to write agents which run on moltbook
  • ClawHub is a forum where you register and discover new skills.

OpenClaw: Sequential Markov Decision Problem (MDP)

OpenClaw frames agent as a sequential, Markov Decision Problem:

  • state = current conversational / tool context
  • action = generated response or tool-use step
  • transition = what happens next in the environment
  • reward = inferred from the resulting next state

OpenClaw reframes ordinary agent interaction as an MDP-like loop - rather than relying only on static preference datasets, it learns from what actually happens after the model acts.

OpenClaw: RL Techniques

Key RL idea: learn from experience, not just offline data. It utilises two kinds of signal from the next state:

  • evaluative signal: how good the action was
  • directive signal: how the action should be improved

OpenClaw’s emphasises is on a fully asynchronous setup:

  • servers,
  • rollout collection,
  • judging / reward estimation, and
  • training

all run concurrently

OpenClaw: RL Techniques (continued)

It connects LLM agents back to standard RL ideas

  • but in open-ended, tool-using, partially observed environments

Limitation:

  • reward estimation is indirect and depends on judges / heuristics
  • stability and safety rely on multi-agent architecture techniques

OpenClaw uses tools, shells, and GUIs.

Example: MiniMax / Forge: large-scale RL for agents

MiniMax is an RL framework for training agentic foundation models at scale.

  • Associated with MiniMax M2.5 is a mixture of experts (MoE) foundation model trained across many real-world task environments
  • Forge is the RL framework/infrastructure, the trainer, specialises in connecting models with live environments (terminal, browsers, code repositories etc.)

Main concern: scaling RL while balancing three competing goals:

  • throughput
  • stability
  • agent flexibility

MiniMax / Forge: large-scale RL for agents (continued)

The emphasis is less on a single elegant RL algorithm, and more on:

  • scalable infrastructure
  • asynchronous scheduling
  • efficient rollout + training pipelines
  • composite reward design

MiniMax / Forge: why it matters for RL

Reinforcement learning is being pushed into messy, long-horizon productivity tasks:

  • coding
  • search
  • office workflows
  • tool use

Key insight:

  • frontier RL is increasingly about training agents in many realistic environments

MiniMax can be used by no-code agent platforms such as MindStudio


RL can optimise task decomposition and tool-use policies for MiniMax / Forge

  • rewards may be composite, delayed, noisy, and environment-specific
  • scaling requires attention to off-policy effects, variance, and system design

Comparison with classical RL:

  • same basic loop: act \(\rightarrow\) observe \(\rightarrow\) evaluate \(\rightarrow\) update
  • different regime: huge contexts, tool chains, heterogeneous tasks, expensive rollouts

Planning & RL for Transformer Models

Planning & RL techniques for Transformers?

Sequential decision-making is on the ascendant:

  • RL, model-based control, and planning-like reasoning are central to agents, robotics, and tools using tensor flow (transformer) architectures.

  • So “planning” and “RL” must live inside parallel, scalable neural network systems.

Limiting assumptions:

  • Classical planning & Reinforcement Learning’s typical teaching setup (fully observable, deterministic, stationary, discrete) mismatches many modern AI settings.

Planning & RL techniques for Transformers (continued)

Compute & tooling:

  • Tree search doesn’t map cleanly onto GPU/TPU throughput the way dense tensor operators do

  • Differentiability matters for end-to-end training, credit assignment, and integration with deep stacks

Question: What is required to run RL & Planning GPUs?

Nvidia’s liquid cooled GB200 Grace Blackwell (Tensor Core) Superchip can connect up to 576 of Blackwell GPUs in a single domain with over 1 PB/s total bandwidth (image by 极客湾Geekerwan, CC BY 3.0, Link)

What is required to run RL & Planning on a GPU?

- During training?

- At inference (query time)?

Value & Policy Approximation: Soft Actor-Critic (SAC)

Soft Actor-Critic (SAC): Differentiable Planning

Greedy next-step choice using max

  • defines the Bellman optimality operator used in Q-learning/DQN. \[ Q^*(s,a) = r(s,a) + \gamma \,\mathbb{E}_{s'\sim P}\!\left[\max_{a'} Q^*(s',a')\right] \]

Soft (Entropy-Regularized) Bellman Backup - “softmax” \[ \begin{align*} Q_{\text{soft}}(s,a) & = r(s,a) + \gamma \,\mathbb{E}_{s'\sim P}\!\left[ V_{\text{soft}}(s')\right] \quad \\[0pt] V_{\text{soft}}(s) & = \mathbb{E}_{a'\sim \pi(\cdot|s)}\!\left[ Q_{\text{soft}}(s,a') - \alpha \log \pi(a'|s)\right] \end{align*} \]

  • Adds entropy bonus (temperature \(\alpha\)) ⇒ softens the hard max.

  • As \(\alpha\!\to\!0\): \(V_{\text{soft}}(s)\!\to\!\max_{a'} Q(s,a')\) (recovers hard backup).

Soft Actor-Critic (SAC): Implementation note

A min over two critics, \(\min\{Q_{\theta_1},Q_{\theta_2}\}\), is often used to reduce overestimation bias (Double-Q trick), not as the backup operator.

  • This relaxation allows gradients to flow through the planning step.

Example: Table Tennis Player using Soft Actor-Critic (SAC)

Sony-AI’s Ace player is trained entirely on single shots in simulation using the Soft Actor-Critic (SAC) algorithm to return ball with desired skill, beating elite players.

Imagination & Latent Rollouts: STORM

STORM (Stochastic Transformer World Models): Atari

Stochastic Transformer-based wORld Model (STORM) targets the Atari 100k benchmark games, like MuZero.

STORM: How it works

A world model-based RL method for learning in imagination (simulation inside a learned world model)

It combines:

  • Categorical Variational Autoencoder (VAE): Encodes raw images into discrete latent “tokens.”
  • Transformer Dynamics: An attention-based sequence model that predicts future states (imagination).
  • Stochastic Latent Variable: Accounts for environmental uncertainty and randomness.
  • Actor-Critic Agent: A separate neural network learns the optimal policy by “practising” entirely within the Transformer’s imagined rollouts.

STORM: Imagination & Latent Rollouts

Imagination: Latent rollouts in STORM
  1. Start: Use a latent token (from the VAE) as the initial “imagined” state.
  2. Action: The Actor selects an action for this state.
  3. Dynamics: The Transformer predicts a stochastic (probabilistic) distribution for the next state and reward.
  4. Judgement: The Critic estimates the long-term value of this new state.
  5. Loop: Sample a token from the distribution and feed it back into Step 2

Repeating the steps in imagination forms a trajectory.

This allows the agent to learn a policy (how to act) or a value function (how “good” a state is) purely by practising inside its own “head”.

Imagination is technically known as latent rollouts, a widely accepted approach for generative model-based reinforcement learning.

STORM: Motivation

Transformers are strong at long-range sequence modelling.

Stochastic latents help capture uncertainty and non-determinism

  • i.e. a stochastic latent says, given this past, there may be several plausible next hidden states

STORM achieves 126.7% mean human-normalised score on the Atari 100k benchmark, a state-of-the-art result among methods without look-ahead search.

MuZero branches over actions with MCTS, whereas STORM rolls forward imagined latent sequences in a transformer world model

  • This makes STORM more GPU-friendly and more naturally aligned with end-to-end differentiable learning.

Comparison: MuZero versus STORM

Aspect MuZero STORM
Core idea Combines learning + Monte-Carlo Tree Search (MCTS) in latent space Learns a stochastic transformer world model and performs imagination rollouts
Planning form Expands a search tree: (s_0 s_1, s_2, ) Rolls out a latent sequence using a transformer world model: (z_t z_{t+1} z_{t+2} )
Model components Representation (h(o_t)), Dynamics (g(s,a)), Prediction (f(s)) Tokenizer / latent encoder, stochastic transformer dynamics model, reward/value/policy heads
Computation Search-based, often CPU-heavy, not end-to-end differentiable through search GPU-friendly, transformer-based, trained through batched imagined trajectories
Learning loop Tree search generates improved policies; network distils them via supervised losses Actor–critic trained on imagined trajectories generated by the transformer world model
Search structure Discrete branching, value backups Sequential imagination, typically no explicit branching tree search
Output policy Derived from visit counts in the search tree Learned directly through gradient updates from imagined rollouts
Analogy Plan by explicit search Learn through transformer imagination

Policy Gradient: Proximal Policy Optimisation (PPO)

PPO Fine tuning ChatGPT & human feedback (Revisited)

  • Proximal policy optimisation (PPO) is used by GPT-3.5 onwards

PPO in Agentic AI

PPO is becoming popular for Agentic AI modes, and is used in

  • GPT’s Operator, and
  • Claude’s Computer Use modes

PPO is also becoming popular for World Models in robotics, and is used in

  • Vision, Language & Action (VSA) attention-based transformers
  • Robot RT-X transformers

Proximal Policy Optimisation (PPO) for LLMs

PPO treats the LLM as a policy over tokens and updates it using reward feedback.

The purpose is to make better responses more likely, but without letting the model drift too far in a single update. This makes reinforcement-learning-based fine-tuning much more stable.

  • generate responses
  • score them with a reward model or preference signal
  • update the model toward higher-reward outputs
  • constrain the update so the change remains small

PPO is a stability-oriented way to nudge an LLM toward higher-reward behaviour.

Proximal Policy Optimisation (PPO) for LLMs (continued)


Goal: Stable, sample-efficient policy improvement

Idea: Constrain how far the new policy moves from the old one at each update during actor-critic cycle

Policy Objective Functions (Revisited)

From a policy gradient perspective, the true objective function \(J^{{\pi}_{\theta}}(\theta)\) over policy \(\pi_{\theta}\) is defined as follows

\[ J^{{\pi}_{\theta}}(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}} [G_0]=V^{\pi \theta}(d_0) \]

where \(d_0\) is the distribution over initial states at the start of each episode \(\tau\)

However, there are two problems in the actor-critic setting:

1. Computing \(J^{{\pi}_{\theta}}(\theta)\) exactly would require integrating over all trajectories, \(\tau\), of the current policy, which is impractical

2. If we update the parameters \(\theta\), it will effect objective value during the optimisation process, leading to (circular) feedback

We therefore need a surrogate objective independent of the trajectory distribution under the new policy \(\color{blue}{\pi_{\theta}}\) we are building

Surrogate Objective

From the policy-gradient theorem, we can define the importance ratio

\[ r_t(\theta) \;=\; \frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\text{old}}}(a_t\mid s_t)} \qquad \]

We now define the surrogate objective \(L_{PG}\) for the true objective

\[ L_{PG}(\theta) \;=\; \mathbb{E}_t\!\big[\, r_t(\theta)\,\hat A_t \,\big] \]

  • Where \(\hat{A}_t\) captures how much better action \(a_t\) was than the state’s average

  • \(\hat{A}_t\) is an estimator of the true advantage function \(A^{\pi}\)

  • \(A^{\pi}(s_t,a_t) = Q^{\pi}(s_t, a_t) - V^{\pi}(S_t)\)

Clipped Surrogate (Core Idea)

Kullback-Leibler (KL) divergence theory tells us we want improvement without overly large steps in policy space, so we define

\[ L^{\text{CLIP}}(\theta) \;=\; \mathbb{E}_t\! \left[ \min\!\Big( r_t(\theta)\,\hat A_t,\; \mathrm{clip}\!\big(r_t(\theta),\,1-\epsilon,\,1+\epsilon\big)\,\hat A_t \Big) \right] \]

  • If \(r_t(\theta)\) leaves the interval \([1-\epsilon,\,1+\epsilon]\), the objective is clipped.

  • Typical range for \(\epsilon \in [0.1,\,0.2]\).

  • Prevents destructive updates while preserving ascent direction

The clipped surrogate objective in PPO plays a similar stabilising role to compatible function approximation — both constrain policy updates so that gradient estimates remain accurate and unbiased with respect to the true policy improvement direction.


Complete PPO loss

\[ L^{\text{PPO}}(\theta) = \mathbb{E}_t\!\Big[{\color{blue}{L^{\text{CLIP}}(\theta)}} - c_1\,{\color{red}{\big(V_\theta(s_t)-V_t^{\text{target}}\big)^2}} + c_2\,\mathcal{H}\!\left[\pi_\theta(\cdot\mid s_t)\right] \Big] \] \(\;\;\;\;\;\;\;\;\;\;\;\;\;\;\)where \(c_1, c_2\) are coefficients

The actors policy gradient (surrogate objective) is \(\color{blue}{L^{CLIP}(\theta)}\)

  • This encourages the policy to increase the probability of actions with positive advantage and decrease it for negative ones

The critics value function is \(\color{red}{\big(V_\theta(s_t)-V_t^{\text{target}}\big)^2}\)

  • This trains the network to predict correct returns (mean-squared error).

PPO Entropy bonus

The entropy bonus \(\mathcal{H}\) encourages exploration

\[ \mathcal{H}\big[\pi_\theta(\cdot \mid s_t)\big] = - \sum_{a} \pi_\theta(a \mid s_t) \log \pi_\theta(a \mid s_t) \]

The entropy term encourages exploration by rewarding stochastic (uncertain) policies.

  • It’s high when the policy is uncertain or “spread out” (exploratory).

  • It’s low when the policy is confident or deterministic.

  • The dot “\(\cdot\)” in \(\pi_{\theta}(\cdot | s_t)\) means over all possible actions, i.e. the vector of probabilities  \(\pi_{\theta}(a_1,s_t), \pi_{\theta}(a_2,s_t), \ldots\)

In practice this maintains stochasticity until policy becomes more confident or deterministic.

Generalised Advantage Estimation (GAE)

In practice, PPO uses a low-variance, low-bias estimate of the advantage \(A^\pi(s_t,a_t)\).

TD error: \[ \delta_t \;=\; r_t + \gamma\,V_\phi(s_{t+1}) - V_\phi(s_t) \]

GAE-\(\lambda\): \[ \hat A_t^{(\lambda)} \;=\; \sum_{l=0}^{\infty} (\gamma\lambda)^l\,\delta_{t+l} \;=\; \delta_t + \gamma\lambda\,\delta_{t+1} + (\gamma\lambda)^2\,\delta_{t+2} + \cdots \]

Return/target used for critic \[ \hat V_t^{\text{target}} \;=\; \hat A_t^{(\lambda)} + V_\phi(s_t) \]

  • \(\lambda\in[0,1]\) trades bias \(\leftrightarrow\) variance, typical PPO: \(\gamma \approx 0.99\), \(\lambda \approx 0.95\).

PPO Algorithm

PPO Algorithm

Repeat

\(\;\;\;\) Collect trajectories with \(\pi_{\theta_{\text{old}}}\)

\(\;\;\;\) Compute returns and advantages using GAE-\(\lambda\)

\(\;\;\;\) Optimise \(L^{\text{CLIP}}\) for \(K\) epochs over mini-batches

\(\;\;\;\) Update old params: \(\theta_{\text{old}} \leftarrow \theta\)

Until a stop condition holds (e.g., total timesteps \(\geq T\), or moving-average return \(\geq R_{\text{target}}\), or max iterations reached)

  • Multiple epochs over the same batch are okay because clipping limits drift

Why PPO works (intuition)

First-order solution method

  • No additional constraint solving required which can introduce second-order effects

Trust-region-like behaviour via clipping

  • Optimisation within trust-region involves only taking steps that stay within a region where local approximation is reliable (founded on KL information theory)

Robust across discrete/continuous control

  • Strong baseline performance in practice

Policy Optimisation Techniques

DeepSeek’s R1 uses Group Relative Policy Optimization (GRPO), forming the core of the reasoning process, and operates over the reasoning token stream.

  • Shao et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, arXiv:2402.03300v3, pp1-30, 2024

  • DeepSeek-R1 uses a “pure RL” approach (DeepSeek-R1-Zero) that utilises self-supervised trial and error approach, without the need for human supervision, rather like MuZero.

  • GRPO can be used for knowledge distillation, where a student model mimics a teacher model via supervised fine-tuning (SFT). The “Teacher” model acts as the Reward Function.

Policy Optimisation Techniques (continued)

Direct Preference Optimisation (DPO)

Policy optimisation utilisation

Foundation Model RL/Query Optimiser Example
Attention-based transformer / LLM PPO, DPO, GRPO / CoT GPT, Gemini, Claude, DeepSeek-R1 - Thinking, Programming, Operator and Computer Use modes; OpenClaw-RL etc.
Attention-based transformer / Agentic & Vision + Language + Action (VLA) PPO OpenClaw-RL, MiniMax, VLA, RT-X,

Chain of thought (CoT)

Multi Actor-Critic: IMPALA & V-trace

Distributed RL: IMPALA & V-trace

Importance Weighted Actor-Learner Architecture (IMPALA)

  • In production in DeepMind, OpenAI & Google DeepRL

  • Decoupled actor-learner: many CPU actors generate trajectories under behaviour policy \(\mu\); a central GPU learner updates \(\pi\).

  • High throughput via batched unrolls (e.g., length (n)); supports RNNs (LSTM) and multi-task.

  • Challenge: policy lag \(\rightarrow\) off-policy data.

  • Solution: V-trace targets for stable off-policy learning.

Off-policy with correction that handles policy lag without sacrificing throughput

IMPALA was developed at DeepMind

Distributed RL: V-trace essentials

Let importance ratios \(\displaystyle \rho_t=\min\!\left(\bar{\rho}, \frac{\pi(a_t|x_t)}{\mu(a_t|x_t)}\right)\)
\(\displaystyle c_t=\min\!\left(\bar{c}, \frac{\pi(a_t|x_t)}{\mu(a_t|x_t)}\right)\) with \(\bar{\rho}\ge \bar{c}\)

Value target (per time \(s\))
\[ \delta_t^{V} \;=\; \rho_t\Big(r_t + \gamma\,V(x_{t+1}) - V(x_t)\Big),\;\;v_s \;=\; V(x_s) \;+\; \sum_{t=s}^{s+n-1} \gamma^{\,t-s} \!\left(\prod_{i=s}^{t-1} c_i\right)\! \delta_t^{V} \]

Policy gradient with V-trace advantage
\[ A_t^{\text{V-trace}} \;=\; r_t + \gamma\,v_{t+1} - V(x_t), \qquad \nabla_\theta J \;\propto\; \rho_t\,\nabla_\theta \log \pi_\theta(a_t|x_t)\,A_t^{\text{V-trace}} \]

Loss (typical)
\[ \mathcal{L} \;=\; \underbrace{\mathbb{E}\big[(v_s - V(x_s))^2\big]}_{\text{value}} \;-\; \beta\,\underbrace{\mathbb{E}\big[\rho_t \log \pi(a_t|x_t)\,A_t^{\text{V-trace}}\big]}_{\text{policy}} \;-\; \eta\,\underbrace{\mathbb{E}\big[\mathcal{H}(\pi(\cdot|x_t))\big]}_{\text{entropy}} \]


Why it works

  • Clipped IS ratios \((\rho_t, c_t)\) tame variance/bias;

  • Multi-step correction handles policy lag without sacrificing throughput.

Representative efficient actor-critic methods

Category Example algorithms Key strengths
On-policy PPO Stable, parallelizable, easy; standard in LLM fine-tuning (RLHF)
Off-policy (stochastic) SAC Maximum-entropy objective → robust exploration; excellent data efficiency
Distributed IMPALA, V-trace Massive scalability; production in DeepMind, OpenAI, Google DeepRL

Efficiency and Performance Comparison

Dimension MuZero / Sampled MuZero / EfficientZero PPO / SAC / IMPALA
Sample efficiency Excellent when planning can reuse a model (Atari, board games) High for off-policy (SAC); moderate for PPO
Wall-clock / GPU efficiency Poor (search is serial & CPU-bound) Very good (fully parallel on GPU)
Robustness & stability Sensitive to model errors / rollout length Stable with tuned hyper-parameters
Scalability to real-time tasks Hard (search latency) Good; used in robotics, continuous control, large-scale RL (IMPALA, V-trace)
Best-case performance Outstanding in structured domains (Go, Atari) State-of-the-art in most continuous-control and real environments

Vision & Robot Transformers: ViTs & RT-X

Example: Vision Transformers (ViTs)

Since ~2020, attention-based Transformers have started competing and often surpassing CNNs in large-scale vision benchmarks.

Image \(\rightarrow\) patches \(\rightarrow\) tokens \(\rightarrow\) transformer

  • Patchify the image: split an image of size (HW C) into non-overlapping patches (PP).
    Number of tokens \(N=\frac{HW}{P^2}\).

  • Linear patch embedding: flatten each patch \(x_i\in\mathbb{R}^{(P^2C)}\) and project
    \(z_i^0 = W_E x_i + b_E \in \mathbb{R}^D\).
    (Often implemented as a conv with kernel/stride \(P\).)

Vision Transformers - Tokens and Transformer Encoding

  • Add a class token and positions: prepend \([\mathrm{CLS}]\) and add learnable positions
    \(\tilde{z}_i^0 = z_i^0 + p_i\), with sequence \([\tilde{z}_{\text{CLS}}^0, \tilde{z}_1^0,\ldots,\tilde{z}_N^0]\).

  • Transformer encoder stack (repeated (L) times):
    \(\text{SA}(X) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V\) with multi-head self-attention,
    then MLP; both with residuals + layer norm.

  • Prediction head: take the final \([\mathrm{CLS}]\) (or pool all tokens) \(\to\) linear head () class probs.

  • Note: smaller \(P\) \(\Rightarrow\) more tokens (detail ↑, cost ↑); larger \(P\) \(\Rightarrow\) fewer tokens (detail ↓, cost ↓).
    Variants like Swin use local windows with shifts for scalability; ViT uses global attention.

Example: RT-X Robot Transformers

RT-X:

Increasingly transformers are also being used for robotics (e.g. RT-1, RT-2, RT-X Google DeepMind)

  • large-scale imitation across many robots.

“RT-family” includes hybrid attention across vision, language, and control.

  • They utilises Visual, Language, Action (VLA) transformers

References