Last Module: Deep learning & Tree Search
This Module: Planning & RL for Transformers
What is covered? How to implement planning and RL on GPU/TPUs.
Attention-based transformers can simulate multiple futures inside a hidden state.
Self-attention makes lookahead possible:
Implicit planning is an amortised form of planning, i.e. pay the cost of planning upfront during training, so that at test time the system can act without running a full planner or search each time.
Autoregression models a sequence by predicting each element from all previous elements, i.e. \(x_t = f(x_{1:t-1}) + \epsilon\) where \(\epsilon\) is the noise or uncertainty. It factorises a joint distribution as follows:
\[ p(x_{1:T}) = \prod_{t=1}^{T} p(x_t \mid x_{1:t-1}) \]
where \(p(x_{1:T})\) is the probability of seeing the entire sequence.
Used in time series, sequence modelling, and autoregressive transformers.
At inference time, predictions are fed back step-by-step to generate sequences.
To enforce causal structure, transformers use masking (next step).
A causal latent transformer is a transformer that models sequences of latent states, using attention masks to enforce causality or planning structure, just like in autoregressive language models.
Causal latent transformers appear in world-model RL, video generation, and planning-as-generation frameworks (implicit planning in neural networks).
Transformers model sequences using self-attention, where each token computes weighted interactions with other tokens.
Key components:
Token embeddings and positional encodings
Self-attention:
\[ \mathrm{Attn}(Q, K, V) = \mathrm{softmax}\!\left( \frac{QK^\top}{\sqrt{d_k}} \right) V \]
where \(Q=\mathit{queries}\), \(K=\mathit{keys}\), \(V=\mathit{values}\) and \(d_k\) is the dimension of the key vectors (layers of the network) and \(\top\) corresponds to matrix transpose.
Causal masking is used in autoregressive transformers:
Each token attends only to past tokens
Enforces the autoregressive condition
\[ x_t \sim p(x_{t+1} \mid x_{1:t-1}) \]
where \(\sim\) means sampled from.
This condition means that when predicting the next token, the model is only allowed to use earlier tokens, not future ones.
This masked attention mechanism directly carries over to masked latent transformers used in world-model RL.
The original Transformer was introduced as an attention-only sequence model, initially for natural language translation
Importantly, transformers have shown that multiple languages can be treated as sequences or structured token streams
A useful visualisation approach, for understanding attention-based transformers is as follows:
Horizontal Axis (Tokens/Sequence): Each position corresponds to a specific token (word or sub-word) in your input.
Vertical Axis (Layers/Depth): As you move “up” the network layers, the model is building more abstract, contextual meanings. The bottom layers might focus on simple syntax, while higher layers handle complex relationships and intent.
Attention Intersections: The “links” connecting horizontal tokens across the vertical layers represent the Attention Weights, showing which other words the model is looking at to understand the current word.
This 3Blue1Brown video demonstrates the horizontal (tokens) / vertical (layers) visualisation.
For large language models (LLMs) in natural language processing
Tokens = words
Mask = causal mask (can’t see the future)
In masked latent transformers:
Tokens = latent states \(z_t\) (learned hidden representations)
Mask = ensures correct temporal, causal, or planning structure
Pretrained LLMs can also harness PDDL for generalised planning by being prompted with a PDDL domain and a small number of training problem instances.
A solution to generalised planning finds a reusable policy that solves many related problem instances, not just one specific initial state.
GPT-4 is used to synthesise and output a Python generalised planner, the program is tested and if it fails, GPT-4 is re-prompted with debugging feedback.
GPT-4 outperforms one of the leading domain-independent classical planner Fast Downward on generalised planning on benchmarks
GPT-4 is shown to be 1-2 orders of magnitude faster during inference but does not quite match coverage for benchmarks (misses a few). Although subsequent models are likely to improve this.
Silver et al. (2024) Generalized planning in PDDL domains with pretrained large language models https://ojs.aaai.org/index.php/AAAI/article/view/30006
Reasoning in attention-based transformers is implicit because reasoning happens inside a serialized, single-token stream
1. Initial Logic & Calculation
The model begins solving a geometry puzzle by mapping variables.
Reasoning Token Stream
“To find the area, I first need the radius. The circumference is \(20\pi\), so \(2\pi r = 20\pi\), which means \(r = 10\). Now, the area is \(\pi r^2\), so \(10 \times 10 = 100\)…”
2. Self-Correction
A trigger word (“Wait”) signals model has detected a constraint violation.
Reasoning Token Stream (continued)
“…Wait. I misread the prompt. The \(20\pi\) was the area of a different circle, not the circumference. Let me re-evaluate.”
3. Alternative Strategy Exploration
The model discards the old path and restarts with the correct data.
Reasoning Token Stream (continued)
“…If the area is \(20\pi\), then \(\pi r_1^2 = 20\pi\), so \(r_1 = \sqrt{20}\). The prompt says the second radius is double the first…”
4. Final Verification
Before outputting, the model checks for internal consistency.
Reasoning Token Stream (continued)
“…So \(r_2 = 2\sqrt{20}\). Area = \(\pi(2\sqrt{20})^2 = 80\pi\). Does this make sense? Yes, \(80\pi\) is \(4\times\) the original area, matching the square of the radius doubling. Final Answer: \(80\pi\).”
GPT-o1 and subsequent models utilise chain of thought as mechanisms within reasoning.
Claude 3.7 Sonnet introduced chain of thought reasoning in 2025.
DeepSeek-R1 utilises chain of thought
Gemini Deep Think is a reasoning mode designed for hard problems in mathematics, coding, planning, and science.
Rather than producing one immediate answer, it uses:
Extended inference-time compute
More “thinking time” before answering.
Parallel thinking
Multiple hypotheses or solution paths are explored and compared.
Critique and revision
Candidate answers can be checked, refined, or rejected.
RL-shaped reasoning behaviour
Google describes novel reinforcement learning techniques that encourage the model to use longer, multi-step reasoning paths.
DeemMind’s Gemini Deep Think Model wins Gold Medal at International Mathematical Olympiad (IMO)
Deep Think can be viewed as moving from answer prediction toward policy improvement over reasoning traces.
| RL concept | Deep Think analogy |
|---|---|
| State | Current problem + partial reasoning context |
| Action | Generate, branch, critique, revise, verify |
| Reward | Correctness, coherence, problem-solving success |
| Policy | Model’s learned strategy for choosing reasoning steps |
| Search | Parallel exploration of possible solution paths |
Anthropic’s Mythos model utilises human supervised chain of thought
Mythos is useful for penetration testing in cyber security
Utilises human supervised reasoning traces from chains of thought as a training target
Human supervision steers the model toward which vulnerabilities would be genuinely serious, dangerous bug classes, and potential security flaws
The model is trained to learn meaningful exploit hypotheses
Classical planning was originally utilised for penetration testing, through the development of PDDL represented attack models.
Human supervised chain of thought models can create a safety challenge, as the model is trained to generate reasoning traces that humans expect
The risk is that the model learns to produce reasoning that looks acceptable rather than reasoning that is fully faithful
This can lead to what appears from a human perspective to be deceptive behaviour by the model, even though it makes sense from the attention-based transformer’s loss perspective.
Faithfulness of chain-of-thought reasoning under human supervision remains an unresolved problem
There is the key challenge:
Performance: reasoning-trace supervision can improve capability (e.g. decomposition, verification, structure)
Faithfulness: the trace may not reflect the model’s true internal reasoning
A reasoning trace can be, useful, plausible and reward-optimised without being causally faithful, i.e. Improving reasoning \(\neq\) ensuring faithful explanations
DeepMind released Gemma 4 on 2 April, 2026.
Performance: The 31B model acts as a “server-grade” model for local use, while smaller variants (2B/4B) are optimised for mobile, edge and on-device.
Agentic Capabilities: Built for multi-step planning, autonomous actions, and tool use.
Multimodal: Natively supports text, images, and audio, along with dynamic vision resolution.
Large Context: Supports up to 256,000 token context windows
Open-Source: Released under the Apache 2.0 licence
Knowledge distillation is a key element in training Gemma 4.
Major paradigms are emerging which reflect a convergence of RL with autoregression and transformer models.
Multi Actor-Critic Transformers (online, no dynamics).
World-Model RL (planning, includes dynamics)
Vision (multi-modal)
Agentic (roles)
Actor-critic transformers are RL agents that use transformer architectures to parameterise
Actor-Critic Transformers essentially outperform Long Short-Term Memory (LSTMs) networks in long-horizon Partially Obervable Markov Decision Processes (POMDPs)
| Layer | Role (e.g. of an Agent or Robot) | Typical tools |
|---|---|---|
| Cognition / Reasoning | Query answering, Programming, multi-step thinking & planning: goal decomposition, tool selection, self-reflection, backtracking & correction, safety checks etc. | LLM with serialized Chain of Thought (GPT, Gemini, Claude, DeepSeek - Thinking, Programming & Operator modes; OpenClaw, MiniMax, etc.) |
| Semantic Policy (Vision–Language–Action) | Grounds instructions & scene into actionable subgoals / waypoints | Vision-Language-Action (VLA) Transformers (RT-X / RT-2-X) |
| Control / Dynamics | Execute precise motions, stabilize, react to dynamic feedback | Generative Model-based RL (STORM) |
Agent decomposition can itself be learned through chain of thought reasoning
Attention-based transformers can infer useful sub-agents, subgoals, or roles from by serialization of context in token stream
In this sense, agentic structure need not be engineered entirely by hand; it can be learned from human data, experience or self-supervision (simulation).
OpenAI’s Codex CLI serves as a primary example of how Chain-of-Thought (CoT) prompts can naturally transition into Agent Decomposition for software tasks.
Architectural Training:
Prompting by Example: Use few-shot prompting \(+\) examples of successful system architectures to “train” the model on how to delegate.
Reasoning-to-Execution: CoT allows the agent to think (e.g., “I need a database schema first, then the API agents”) before the CLI executes the file creation.
Feedback Loop:
Self-Correction: The agent can inspect CLI error outputs to refine its decomposition logic in real-time.
Modularity: High-level CoT ensures each module is built in isolation, mimicking a multi-agent environment within a single interface.
“The conversation with the CLI essentially orchestrates a system of sub-agents directed by a serialised chain of thought reasoning chain.”
The decomposition of agents into multiple sub-agents, together with their respective roles, coordination mechanisms, and communication protocols remains an important challenge in the agentic approach.
Research and development into multi-agent systems has a long and fertile history in artificial intelligence
OpenClaw is a framework for training LLM-based agents, such as Codex CLI/Claude, online, from normal usage.
Core idea: after each action, the agent observes the next state:
These next states are treated as RL feedback signals.
OpenClaw-RL allows for agents to learn and improve by simply talking to your agents
OpenClaw has experienced record-breaking growth since its launch in November 2025
OpenClaw frames agent as a sequential, Markov Decision Problem:
OpenClaw reframes ordinary agent interaction as an MDP-like loop - rather than relying only on static preference datasets, it learns from what actually happens after the model acts.
Key RL idea: learn from experience, not just offline data. It utilises two kinds of signal from the next state:
OpenClaw’s emphasises is on a fully asynchronous setup:
all run concurrently
It connects LLM agents back to standard RL ideas
Limitation:
OpenClaw uses tools, shells, and GUIs.
MiniMax is an RL framework for training agentic foundation models at scale.
Main concern: scaling RL while balancing three competing goals:
The emphasis is less on a single elegant RL algorithm, and more on:
Reinforcement learning is being pushed into messy, long-horizon productivity tasks:
Key insight:
MiniMax can be used by no-code agent platforms such as MindStudio
RL can optimise task decomposition and tool-use policies for MiniMax / Forge
Comparison with classical RL:
Sequential decision-making is on the ascendant:
RL, model-based control, and planning-like reasoning are central to agents, robotics, and tools using tensor flow (transformer) architectures.
So “planning” and “RL” must live inside parallel, scalable neural network systems.
Limiting assumptions:
Compute & tooling:
Tree search doesn’t map cleanly onto GPU/TPU throughput the way dense tensor operators do
Differentiability matters for end-to-end training, credit assignment, and integration with deep stacks
Nvidia’s liquid cooled GB200 Grace Blackwell (Tensor Core) Superchip can connect up to 576 of Blackwell GPUs in a single domain with over 1 PB/s total bandwidth (image by 极客湾Geekerwan, CC BY 3.0, Link)
What is required to run RL & Planning on a GPU?
- During training?
- At inference (query time)?
Greedy next-step choice using max
Soft (Entropy-Regularized) Bellman Backup - “softmax” \[ \begin{align*} Q_{\text{soft}}(s,a) & = r(s,a) + \gamma \,\mathbb{E}_{s'\sim P}\!\left[ V_{\text{soft}}(s')\right] \quad \\[0pt] V_{\text{soft}}(s) & = \mathbb{E}_{a'\sim \pi(\cdot|s)}\!\left[ Q_{\text{soft}}(s,a') - \alpha \log \pi(a'|s)\right] \end{align*} \]
Adds entropy bonus (temperature \(\alpha\)) ⇒ softens the hard max.
As \(\alpha\!\to\!0\): \(V_{\text{soft}}(s)\!\to\!\max_{a'} Q(s,a')\) (recovers hard backup).
A min over two critics, \(\min\{Q_{\theta_1},Q_{\theta_2}\}\), is often used to reduce overestimation bias (Double-Q trick), not as the backup operator.
Sony-AI’s Ace player is trained entirely on single shots in simulation using the Soft Actor-Critic (SAC) algorithm to return ball with desired skill, beating elite players.
Stochastic Transformer-based wORld Model (STORM) targets the Atari 100k benchmark games, like MuZero.
A world model-based RL method for learning in imagination (simulation inside a learned world model)
It combines:
Imagination: Latent rollouts in STORM
Repeating the steps in imagination forms a trajectory.
This allows the agent to learn a policy (how to act) or a value function (how “good” a state is) purely by practising inside its own “head”.
Imagination is technically known as latent rollouts, a widely accepted approach for generative model-based reinforcement learning.
Transformers are strong at long-range sequence modelling.
Stochastic latents help capture uncertainty and non-determinism
STORM achieves 126.7% mean human-normalised score on the Atari 100k benchmark, a state-of-the-art result among methods without look-ahead search.
MuZero branches over actions with MCTS, whereas STORM rolls forward imagined latent sequences in a transformer world model
| Aspect | MuZero | STORM |
|---|---|---|
| Core idea | Combines learning + Monte-Carlo Tree Search (MCTS) in latent space | Learns a stochastic transformer world model and performs imagination rollouts |
| Planning form | Expands a search tree: (s_0 s_1, s_2, ) | Rolls out a latent sequence using a transformer world model: (z_t z_{t+1} z_{t+2} ) |
| Model components | Representation (h(o_t)), Dynamics (g(s,a)), Prediction (f(s)) | Tokenizer / latent encoder, stochastic transformer dynamics model, reward/value/policy heads |
| Computation | Search-based, often CPU-heavy, not end-to-end differentiable through search | GPU-friendly, transformer-based, trained through batched imagined trajectories |
| Learning loop | Tree search generates improved policies; network distils them via supervised losses | Actor–critic trained on imagined trajectories generated by the transformer world model |
| Search structure | Discrete branching, value backups | Sequential imagination, typically no explicit branching tree search |
| Output policy | Derived from visit counts in the search tree | Learned directly through gradient updates from imagined rollouts |
| Analogy | “Plan by explicit search” | “Learn through transformer imagination” |
PPO is becoming popular for Agentic AI modes, and is used in
PPO is also becoming popular for World Models in robotics, and is used in
PPO treats the LLM as a policy over tokens and updates it using reward feedback.
The purpose is to make better responses more likely, but without letting the model drift too far in a single update. This makes reinforcement-learning-based fine-tuning much more stable.
PPO is a stability-oriented way to nudge an LLM toward higher-reward behaviour.
Goal: Stable, sample-efficient policy improvement
Idea: Constrain how far the new policy moves from the old one at each update during actor-critic cycle
From a policy gradient perspective, the true objective function \(J^{{\pi}_{\theta}}(\theta)\) over policy \(\pi_{\theta}\) is defined as follows
\[ J^{{\pi}_{\theta}}(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}} [G_0]=V^{\pi \theta}(d_0) \]
where \(d_0\) is the distribution over initial states at the start of each episode \(\tau\)
However, there are two problems in the actor-critic setting:
1. Computing \(J^{{\pi}_{\theta}}(\theta)\) exactly would require integrating over all trajectories, \(\tau\), of the current policy, which is impractical
2. If we update the parameters \(\theta\), it will effect objective value during the optimisation process, leading to (circular) feedback
We therefore need a surrogate objective independent of the trajectory distribution under the new policy \(\color{blue}{\pi_{\theta}}\) we are building
From the policy-gradient theorem, we can define the importance ratio
\[ r_t(\theta) \;=\; \frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\text{old}}}(a_t\mid s_t)} \qquad \]
We now define the surrogate objective \(L_{PG}\) for the true objective
\[ L_{PG}(\theta) \;=\; \mathbb{E}_t\!\big[\, r_t(\theta)\,\hat A_t \,\big] \]
Where \(\hat{A}_t\) captures how much better action \(a_t\) was than the state’s average
\(\hat{A}_t\) is an estimator of the true advantage function \(A^{\pi}\)
\(A^{\pi}(s_t,a_t) = Q^{\pi}(s_t, a_t) - V^{\pi}(S_t)\)
Kullback-Leibler (KL) divergence theory tells us we want improvement without overly large steps in policy space, so we define
\[ L^{\text{CLIP}}(\theta) \;=\; \mathbb{E}_t\! \left[ \min\!\Big( r_t(\theta)\,\hat A_t,\; \mathrm{clip}\!\big(r_t(\theta),\,1-\epsilon,\,1+\epsilon\big)\,\hat A_t \Big) \right] \]
If \(r_t(\theta)\) leaves the interval \([1-\epsilon,\,1+\epsilon]\), the objective is clipped.
Typical range for \(\epsilon \in [0.1,\,0.2]\).
Prevents destructive updates while preserving ascent direction
The clipped surrogate objective in PPO plays a similar stabilising role to compatible function approximation — both constrain policy updates so that gradient estimates remain accurate and unbiased with respect to the true policy improvement direction.
\[ L^{\text{PPO}}(\theta) = \mathbb{E}_t\!\Big[{\color{blue}{L^{\text{CLIP}}(\theta)}} - c_1\,{\color{red}{\big(V_\theta(s_t)-V_t^{\text{target}}\big)^2}} + c_2\,\mathcal{H}\!\left[\pi_\theta(\cdot\mid s_t)\right] \Big] \] \(\;\;\;\;\;\;\;\;\;\;\;\;\;\;\)where \(c_1, c_2\) are coefficients
The actors policy gradient (surrogate objective) is \(\color{blue}{L^{CLIP}(\theta)}\)
The critics value function is \(\color{red}{\big(V_\theta(s_t)-V_t^{\text{target}}\big)^2}\)
The entropy bonus \(\mathcal{H}\) encourages exploration
\[ \mathcal{H}\big[\pi_\theta(\cdot \mid s_t)\big] = - \sum_{a} \pi_\theta(a \mid s_t) \log \pi_\theta(a \mid s_t) \]
The entropy term encourages exploration by rewarding stochastic (uncertain) policies.
It’s high when the policy is uncertain or “spread out” (exploratory).
It’s low when the policy is confident or deterministic.
The dot “\(\cdot\)” in \(\pi_{\theta}(\cdot | s_t)\) means over all possible actions, i.e. the vector of probabilities \(\pi_{\theta}(a_1,s_t), \pi_{\theta}(a_2,s_t), \ldots\)
In practice this maintains stochasticity until policy becomes more confident or deterministic.
In practice, PPO uses a low-variance, low-bias estimate of the advantage \(A^\pi(s_t,a_t)\).
TD error: \[ \delta_t \;=\; r_t + \gamma\,V_\phi(s_{t+1}) - V_\phi(s_t) \]
GAE-\(\lambda\): \[ \hat A_t^{(\lambda)} \;=\; \sum_{l=0}^{\infty} (\gamma\lambda)^l\,\delta_{t+l} \;=\; \delta_t + \gamma\lambda\,\delta_{t+1} + (\gamma\lambda)^2\,\delta_{t+2} + \cdots \]
Return/target used for critic \[ \hat V_t^{\text{target}} \;=\; \hat A_t^{(\lambda)} + V_\phi(s_t) \]
PPO Algorithm
Repeat
\(\;\;\;\) Collect trajectories with \(\pi_{\theta_{\text{old}}}\)
\(\;\;\;\) Compute returns and advantages using GAE-\(\lambda\)
\(\;\;\;\) Optimise \(L^{\text{CLIP}}\) for \(K\) epochs over mini-batches
\(\;\;\;\) Update old params: \(\theta_{\text{old}} \leftarrow \theta\)
Until a stop condition holds (e.g., total timesteps \(\geq T\), or moving-average return \(\geq R_{\text{target}}\), or max iterations reached)
First-order solution method
Trust-region-like behaviour via clipping
Robust across discrete/continuous control
DeepSeek’s R1 uses Group Relative Policy Optimization (GRPO), forming the core of the reasoning process, and operates over the reasoning token stream.
Shao et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, arXiv:2402.03300v3, pp1-30, 2024
DeepSeek-R1 uses a “pure RL” approach (DeepSeek-R1-Zero) that utilises self-supervised trial and error approach, without the need for human supervision, rather like MuZero.
GRPO can be used for knowledge distillation, where a student model mimics a teacher model via supervised fine-tuning (SFT). The “Teacher” model acts as the Reward Function.
Direct Preference Optimisation (DPO)
Rafailov et al. Direct Preference Optimization: Your Language Model is Secretly a Reward Model , arXiv:2305.18290v3, pp1-27, 2023
While GRPO & PPO dominate online RL for reasoning, DPO remains the industry standard for offline preference alignment
Offline data must include high-quality datasets, i.e. chosen vs. rejected pairs.
| Foundation Model | RL/Query Optimiser | Example |
|---|---|---|
| Attention-based transformer / LLM | PPO, DPO, GRPO / CoT | GPT, Gemini, Claude, DeepSeek-R1 - Thinking, Programming, Operator and Computer Use modes; OpenClaw-RL etc. |
| Attention-based transformer / Agentic & Vision + Language + Action (VLA) | PPO | OpenClaw-RL, MiniMax, VLA, RT-X, |
Chain of thought (CoT)
Importance Weighted Actor-Learner Architecture (IMPALA)
In production in DeepMind, OpenAI & Google DeepRL
Decoupled actor-learner: many CPU actors generate trajectories under behaviour policy \(\mu\); a central GPU learner updates \(\pi\).
High throughput via batched unrolls (e.g., length (n)); supports RNNs (LSTM) and multi-task.
Challenge: policy lag \(\rightarrow\) off-policy data.
Solution: V-trace targets for stable off-policy learning.
Off-policy with correction that handles policy lag without sacrificing throughput
IMPALA was developed at DeepMind
Let importance ratios \(\displaystyle \rho_t=\min\!\left(\bar{\rho}, \frac{\pi(a_t|x_t)}{\mu(a_t|x_t)}\right)\)
\(\displaystyle c_t=\min\!\left(\bar{c}, \frac{\pi(a_t|x_t)}{\mu(a_t|x_t)}\right)\) with \(\bar{\rho}\ge \bar{c}\)
Value target (per time \(s\))
\[
\delta_t^{V} \;=\; \rho_t\Big(r_t + \gamma\,V(x_{t+1}) - V(x_t)\Big),\;\;v_s \;=\; V(x_s) \;+\; \sum_{t=s}^{s+n-1} \gamma^{\,t-s}
\!\left(\prod_{i=s}^{t-1} c_i\right)\! \delta_t^{V}
\]
Policy gradient with V-trace advantage
\[
A_t^{\text{V-trace}} \;=\; r_t + \gamma\,v_{t+1} - V(x_t), \qquad
\nabla_\theta J \;\propto\; \rho_t\,\nabla_\theta \log \pi_\theta(a_t|x_t)\,A_t^{\text{V-trace}}
\]
Loss (typical)
\[
\mathcal{L} \;=\; \underbrace{\mathbb{E}\big[(v_s - V(x_s))^2\big]}_{\text{value}}
\;-\; \beta\,\underbrace{\mathbb{E}\big[\rho_t \log \pi(a_t|x_t)\,A_t^{\text{V-trace}}\big]}_{\text{policy}}
\;-\; \eta\,\underbrace{\mathbb{E}\big[\mathcal{H}(\pi(\cdot|x_t))\big]}_{\text{entropy}}
\]
Why it works
Clipped IS ratios \((\rho_t, c_t)\) tame variance/bias;
Multi-step correction handles policy lag without sacrificing throughput.
| Category | Example algorithms | Key strengths |
|---|---|---|
| On-policy | PPO | Stable, parallelizable, easy; standard in LLM fine-tuning (RLHF) |
| Off-policy (stochastic) | SAC | Maximum-entropy objective → robust exploration; excellent data efficiency |
| Distributed | IMPALA, V-trace | Massive scalability; production in DeepMind, OpenAI, Google DeepRL |
| Dimension | MuZero / Sampled MuZero / EfficientZero | PPO / SAC / IMPALA |
|---|---|---|
| Sample efficiency | Excellent when planning can reuse a model (Atari, board games) | High for off-policy (SAC); moderate for PPO |
| Wall-clock / GPU efficiency | Poor (search is serial & CPU-bound) | Very good (fully parallel on GPU) |
| Robustness & stability | Sensitive to model errors / rollout length | Stable with tuned hyper-parameters |
| Scalability to real-time tasks | Hard (search latency) | Good; used in robotics, continuous control, large-scale RL (IMPALA, V-trace) |
| Best-case performance | Outstanding in structured domains (Go, Atari) | State-of-the-art in most continuous-control and real environments |
Since ~2020, attention-based Transformers have started competing and often surpassing CNNs in large-scale vision benchmarks.
Image \(\rightarrow\) patches \(\rightarrow\) tokens \(\rightarrow\) transformer
Patchify the image: split an image of size (HW C) into non-overlapping patches (PP).
Number of tokens \(N=\frac{HW}{P^2}\).
Linear patch embedding: flatten each patch \(x_i\in\mathbb{R}^{(P^2C)}\) and project
\(z_i^0 = W_E x_i + b_E \in \mathbb{R}^D\).
(Often implemented as a conv with kernel/stride \(P\).)
Add a class token and positions: prepend \([\mathrm{CLS}]\) and add learnable positions
\(\tilde{z}_i^0 = z_i^0 + p_i\), with sequence \([\tilde{z}_{\text{CLS}}^0, \tilde{z}_1^0,\ldots,\tilde{z}_N^0]\).
Transformer encoder stack (repeated (L) times):
\(\text{SA}(X) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V\) with multi-head self-attention,
then MLP; both with residuals + layer norm.
Prediction head: take the final \([\mathrm{CLS}]\) (or pool all tokens) \(\to\) linear head () class probs.
Note: smaller \(P\) \(\Rightarrow\) more tokens (detail ↑, cost ↑); larger \(P\) \(\Rightarrow\) fewer tokens (detail ↓, cost ↓).
Variants like Swin use local windows with shifts for scalability; ViT uses global attention.
RT-X:
Increasingly transformers are also being used for robotics (e.g. RT-1, RT-2, RT-X Google DeepMind)
“RT-family” includes hybrid attention across vision, language, and control.
Transformers: Vaswani et al. (2017), Attention Is All You Need NeurIPS 2017 arXiv:1706.03762
PPO: Schulman et al. (2017), Proximal Policy Optimization Algorithms arXiv:1707.06347
DPO: Rafailov et al. (2023), Direct Preference Optimization NeurIPS 2023 arXiv:2305.18290
GRPO: Shao et al. (2024), DeepSeekMath arXiv:2402.03300
DeepSeek-R1 Guo et al. (2025) DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning, Nature https://www.nature.com/articles/s41586-025-09422-z
IMPALA: Espeholt et al. (2018), IMPALA ICML 2018 arXiv:1802.01561
STORM: Zhang et al. (2023), STORM: Efficient Stochastic Transformer based World Models for Reinforcement Learning NeurIPS 2023 arXiv:2310.09615
OpenClaw-RL: Wang et al. (2026), OpenClaw-RL: Train Any Agent Simply by Talking arXiv:2603.10165
Ace (Sony-AI) Durr et al., (2026) Outplaying elite table tennis players with an autonomous robot, Nature https://www.nature.com/articles/s41586-026-10338-5
PDDL with LLMs Silver et al. (2024) Generalized planning in PDDL domains with pretrained large language models https://ojs.aaai.org/index.php/AAAI/article/view/30006