12 Foundation, World and Transformer Models (Advanced Topic)

Slides

This module is also available in the following versions

Model-Based Reinforcement Learning

Last Module: Actor critic (policy gradient)

Previous Modules: Value function approximation using deep learning

This Module: Combining deep learning, tree search and transformer models

AlphaGo, AlphaZero & MuZero

AlphaGo - Core Idea

First system to combine deep learning and tree search for superhuman play.

Domain: Go only.

Pipeline integrates:

Supervised learning from expert games of human players
Reinforcement learning through self-play
Monte Carlo Tree Search (MCTS) guided by neural networks

AlphaGo - Neural networks & losses

Neural Network	Description
Policy Network (\(\pi_1\))	Trained supervisedly on human moves; 13-layer CNN (Go board 19×19 × 48 planes)
Policy Network (\(\pi_2\))	Refined by self-play RL (same architecture)
Value Network (\(v\))	13-layer CNN + 2 fully connected layers; outputs scalar win probability \(v(s)\)

Training objectives: \[ \mathcal{L}_\pi = -\log \pi_\theta(a^\ast|s), \qquad \mathcal{L}_v = (v_\phi(s)-z)^2 \] where \(z \in \{-1,+1\}\) is the game outcome.

AlphaGo Planning integration

MCTS uses:

Policy prior \(\pi_\theta(a|s)\) from policy network \(\pi_1\) (parameters \(\theta\)) \(\rightarrow\) biases search toward likely moves
Value estimate \(v_\phi(s)\) from value network \(v\) (parameters \(\phi\)) \(\rightarrow\) evaluate leaves

Move selection at root:

\[ \pi_{\text{MCTS}}(a|s_0)\propto N(s_0,a)^{1/\tau} \]

Once \(\pi_2\) is trained through reinforcement learning, AlphaGo uses it to play millions of games against itself.

Each game produces pairs \((s, z)\): \((s_t,z_t)\) where \(z \in \{-1,+1\}\) is the game outcome.

These are used to train the value network \(v_{\phi}(s_t)\) via regression:

\[ \min_{\phi} \; \bigl(v_{\phi}(s_t) - z_t\bigr)^2 \]

So the value network learns to predict who will win from any board position that strong play (i.e., \(\pi_2\)) would reach.

Achieved 4–1 win vs Lee Sedol (2016)

AlphaZero - Unified Self-Play RL

Extends AlphaGo \(\rightarrow\) Go, Chess, Shogi

Removes human data and hand-crafted rollout policy
Fully self-play training loop

AlphaZero - Neural network and objective

Single residual CNN shared by policy + value

20 or 40 ResNet blocks, 256 filters, BatchNorm + Rectified Linear Circuit (ReLU)
Input: stack of board planes (19×19×N)
Heads:
- Policy head: 1 conv + 1 FC \(\rightarrow\) softmax over legal moves
- Value head: 1 conv + 2 FC \(\rightarrow\) scalar \(v_\theta(s)\)

Loss: \[ \mathcal{L}(\theta)= (z-v_\theta(s))^2 -\pi_{\text{MCTS}}^\top\!\log\pi_\theta +c\|\theta\|^2 \]

AlphaZero - Learning & Planning Loop

\[ \text{Network} \Rightarrow \text{MCTS} \Rightarrow \text{Self-play games} \Rightarrow \text{Network update} \]

MCTS: ~800 simulations per move
Network: ResNet trained via SGD on MCTS targets
Unified architecture simplified training → superhuman performance across games

MuZero - Learning to Plan Without Rules

AlphaZero still needed explicit game rules
MuZero learns a latent model of dynamics for planning without knowing rules
Same MCTS framework, but search happens in latent space

MuZero - Neural networks (3)

Neural Network	Function	Notes
Representation \(h_\theta\)	Observation \(\rightarrow\) latent state \(s_0\) (learns (latent) state representation)	6 ResNet blocks for Atari (pixels \(\rightarrow\) latent)
Dynamics \(g_\theta\)	Predicts \(s_{t+1},r_{t+1}\) from \((s_t,a_t)\) (learns model)	Small conv stack + reward head
Prediction \(f_\theta\)	Outputs policy \(p_t\) and value \(v_t\) from \(s_t\)	Two heads (softmax policy, scalar value)

MuZero — TD-Style Learning (Bootstrapped Returns)

Unlike AlphaZero (which uses full-episode Monte Carlo targets),
MuZero trains its value network using n-step bootstrapped (TD) returns

For each step (t), the target value is: \[ \hat{v}_t = \sum_{i=0}^{n-1} \gamma^i r_{t+i} + \gamma^n v_\theta(s_{t+n}) \]

Combines observed rewards and bootstrapped value from the predicted future state
Allows credit assignment across long horizons without waiting for episode termination

MuZero minimises a combined loss: \[ \mathcal{L} = \sum_k \Big[ (v_k - \hat v_k)^2 + (r_k - \hat r_k)^2 - \pi_k^\top \log p_k \Big] \]

Value loss: TD-style bootstrapped error
Reward loss: immediate reward prediction
Policy loss: cross-entropy with MCTS visit-count distribution

TD bootstrapping makes MuZero more sample efficient than AlphaZero

Planning (MCTS) provides strong policy/value targets; TD updates keep learning continuous

MuZero - Training and Integration

MCTS operates within the learned model:

\[s_{t+1},r_t=g_\theta(s_t,a_t)\]

Targets from MCTS train all three nets end-to-end

Loss: \[ \mathcal{L} =\sum_k\! \big[ (v_k-\hat v_k)^2 +(r_k-\hat r_k)^2 -\pi_k^\top\!\log p_k \big] \]

Achieves AlphaZero-level play in Go/Chess/Shogi and strong Atari results from pixels

MuZero - Results and Significance

Domain	Training time to superhuman level	Benchmark / Opponent	Notes
Chess	\(\approx\) 4 hours (on 8 TPUv3 pods)	Stockfish	Surpassed world-champion chess engine performance
Shogi	\(\approx\) 2 hours	Elmo	Surpassed leading professional Shogi engine
Go	\(\approx\) 9 hours	AlphaZero / KataGo	Matched AlphaZero’s superhuman play using only learned dynamics
Atari (57 games)	~200M frames	Rainbow / IMPALA	Exceeded or matched best model-free RL baselines across games

Key insights

No rules given: MuZero learned dynamics, value, and policy purely from experience
Unified algorithm: Same architecture and hyperparameters across all domains
Planning efficiency: Performed Monte Carlo Tree Search (MCTS) entirely in latent space
Sample efficiency: Achieved AlphaZero-level play within hours of self-play training

AlphaGo, AlphaZero & MuZero Comparison

System	# of NNs	Architecture	Uses known rules?	Learns model?	Planning
AlphaGo	2 (policy + value)	13-layer CNNs	\(\checkmark\)	\(\text{✗}\)	MCTS with rules
AlphaZero	1 (shared policy-value ResNet)	20–40 ResNet blocks	\(\checkmark\)	\(\text{✗}\)	MCTS with rules
MuZero	3 (\(h,g,f\) modules)	ResNet latent model	\(\text{✗}\)	\(\checkmark\)	MCTS in latent space

Foundation, World and Transformer Models

Three Layer Agent Stack

Layer	Role (e.g. of an Agent or Robot)	Typical tools
Cognition / Reasoning	Query answering, Programming, Multi-step thinking: goal parsing, task decomposition, tool selection, hypothesis & plan revision, safety checks	LLM w/ CoT / ToT/GoT, value-guided decoding, process rewards
Semantic Policy (Vision–Language–Action)	Ground instructions & scene into actionable subgoals / waypoints	RT-X / RT-2-X (VLA Transformer), affordance & object-centric models
Control / Dynamics	Execute precise motions, stabilize, react to feedback	Dreamer V3 / TD-MPC2 / Diffusion Policy (model-based or policy learning)

CoT: Chain of Thought

ToT: Tree of Thought

GoT: Graph of Thought

VLA: Visual, language, Action Transformer

MPC: Model Predictive Control

Modern Trends

Sequential decision-making is on the ascendant: RL, model-based control, and planning-like reasoning are central to agents, robotics, and tools using tensor flow (transformer) architectures—so “planning” and “RL” must live inside differentiable, scalable systems.

Limiting assumptions: Classical planning & Reinforcement Learning’s typical teaching setup (fully observable, deterministic, stationary, discrete) mismatches many modern AI settings.

Compute & tooling: Tree search doesn’t map cleanly onto GPU/TPU throughput the way dense tensor ops do; differentiability matters for end-to-end training, credit assignment, and integration with deep stacks.

Requirements of RL for Foundation and World Models

Requirements of RL (additive)	Approach / System
Non-deterministic (Stochastic)	Policy Gradient
Non-stationary (Generalisation)	Value Approximation
Partially observable (Epistemic)	Actor-Critic
Differentiable (Nnet/GPU integration)	SAC, Dreamer V3
Distributed (Industrial Scaling)	IMPALA, V-trace
Agentic (Layers)	RT-X, LLM, World Models

We explore requirements of differentiable, distributed and agentic RL needed for foundation and world models

Differentiable Planning: Soft Actor Critic (SAC)

Greedy next-step choice using max; defines the Bellman optimality operator used in Q-learning/DQN. \[ Q^*(s,a) = r(s,a) + \gamma \,\mathbb{E}_{s'\sim P}\!\left[\max_{a'} Q^*(s',a')\right] \]

Soft (Entropy-Regularized) Bellman Backup - “softmax” \[ \begin{align*} Q_{\text{soft}}(s,a) & = r(s,a) + \gamma \,\mathbb{E}_{s'\sim P}\!\left[ V_{\text{soft}}(s')\right] \quad \\[0pt] V_{\text{soft}}(s) & = \mathbb{E}_{a'\sim \pi(\cdot|s)}\!\left[ Q_{\text{soft}}(s,a') - \alpha \log \pi(a'|s)\right] \end{align*} \]

Adds entropy bonus (temperature \(\alpha\)) ⇒ softens the hard max.
As \(\alpha\!\to\!0\): \(V_{\text{soft}}(s)\!\to\!\max_{a'} Q(s,a')\) (recovers hard backup).

Implementation note (SAC): a min over two critics, \(\min\{Q_{\theta_1},Q_{\theta_2}\}\), is often used to reduce overestimation bias (Double-Q trick), not as the backup operator.

This relaxation allows gradients to flow through the planning step.

Differentiable Planning: Dreamer V3

Era	Key system	What it added	Influence on Dreamer V3
2019	PlaNet (Hafner et al.)	Latent dynamics model (RSSM) + Cross-Entropy Method (CEM) planning in latent space	Showed model-based imagination from pixels works
2020-22	Dreamer V1–V2	Replaced CEM with actor-critic training in imagination (no explicit planning), making everything differentiable	More efficient, easier to train on GPU
2022-24	TD-MPC / TD-MPC 2 (Hansen et al.)	Combined short-horizon latent Model Predictive Control (MPC) with TD learning, strong continuous-control results	Reinforced ideas of gradient-based MPC and temporal-difference consistency
2023-24	Dreamer V3	Unified architecture: one RSSM world model + imagination-based actor-critic + robust scaling across domains	Synthesizes both model-based planning and policy-gradient RL advantages

Dreamer V3 is developed by DeepMind.

Recurrent State Space Model (RSSM): Dreamer V3

Dreamer V3 uses a learned latent world model and imagination (planning during training) and is fully differentiable.

It uses a recurrent state space (RSSM) world model (both stochastic and recurrent) to imagine trajectories for policy/value learning.
Think of the RSSM as a latent recurrent simulator

An RSSM is implemented as a recurrent neural network (RNN) and unrolled through time

Once the RSSM is trained, Dreamer can roll out future trajectories purely in its imagination
RSSM is unrolled through time during training and “imagination” to simulate trajectories

Comparison of MuZero with Dreamer V3

Aspect	MuZero	Dreamer V3
Core idea	Combines learning + Monte-Carlo Tree Search (MCTS) in latent space	Learns a Recurrent State-Space Model (RSSM) and performs differentiable imagination rollouts
Planning form	Expands a search tree: \(s_0 \rightarrow s_1, s_2, \dots\)	Rolls out a latent sequence: \((h,z)_t \rightarrow (h,z)_{t+1} \rightarrow (h,z)_{t+2} \dots\)
Model components	Representation \(h(o_t)\), Dynamics (g(s,a)), Prediction \(f(s)\)	RSSM with deterministic \(h_t\) and stochastic \(z_t\) states
Computation	Search-based, CPU-heavy, not fully differentiable	GPU-friendly, fully differentiable RNN unrolled through time
Learning loop	Tree search generates improved policies; network distils them via supervised losses	Actor–critic trained entirely on imagined trajectories from the RSSM
Search structure	Discrete branching, value backups	Sequential imagination, no branching
Output policy	Derived from visit counts in the search tree	Learned directly through gradient updates in imagination
Analogy	“Plan by explicit search”	“Plan by differentiable imagination”

\(\textbf{MuZero: explicit look-ahead search} \quad\Longleftrightarrow\quad \textbf{Dreamer: {\color{blue}{continuous latent imagination}}}\)

Take-away summary

For most continuous-control, robotics, or fine-tuning tasks: actor-critic / policy-optimisation (PPO, SAC) are easier, faster, and competitive.
For structured combinatorial or look-ahead-heavy tasks: planning-based hybrids (MuZero, Sampled MuZero, EfficientZero) can still outperform, but with higher engineering and compute cost.

Trend: research is moving toward differentiable world models (DreamerV3) that keep MuZero’s model-based benefits while retaining the simplicity and efficiency of actor–critic learning—essentially bridging the two families.

Distributed RL: IMPALA & V-trace

Importance Weighted Actor-Learner Architecture (IMPALA)

In production in DeepMind, OpenAI & Google DeepRL
Decoupled actor-learner: many CPU actors generate trajectories under behaviour policy \(\mu\); a central GPU learner updates \(\pi\).
High throughput via batched unrolls (e.g., length (n)); supports RNNs (LSTM) and multi-task.
Challenge: policy lag \(\rightarrow\) off-policy data.
Solution: V-trace targets for stable off-policy learning.

Off-policy with correction that handles policy lag without sacrificing throughput

IMPALA was developed at DeepMind

Distributed RL: V-trace essentials

Let importance ratios \(\displaystyle \rho_t=\min\!\left(\bar{\rho}, \frac{\pi(a_t|x_t)}{\mu(a_t|x_t)}\right)\)
\(\displaystyle c_t=\min\!\left(\bar{c}, \frac{\pi(a_t|x_t)}{\mu(a_t|x_t)}\right)\) with \(\bar{\rho}\ge \bar{c}\)

Value target (per time \(s\))
\[ \delta_t^{V} \;=\; \rho_t\Big(r_t + \gamma\,V(x_{t+1}) - V(x_t)\Big),\;\;v_s \;=\; V(x_s) \;+\; \sum_{t=s}^{s+n-1} \gamma^{\,t-s} \!\left(\prod_{i=s}^{t-1} c_i\right)\! \delta_t^{V} \]

Policy gradient with V-trace advantage
\[ A_t^{\text{V-trace}} \;=\; r_t + \gamma\,v_{t+1} - V(x_t), \qquad \nabla_\theta J \;\propto\; \rho_t\,\nabla_\theta \log \pi_\theta(a_t|x_t)\,A_t^{\text{V-trace}} \]

Loss (typical)
\[ \mathcal{L} \;=\; \underbrace{\mathbb{E}\big[(v_s - V(x_s))^2\big]}_{\text{value}} \;-\; \beta\,\underbrace{\mathbb{E}\big[\rho_t \log \pi(a_t|x_t)\,A_t^{\text{V-trace}}\big]}_{\text{policy}} \;-\; \eta\,\underbrace{\mathbb{E}\big[\mathcal{H}(\pi(\cdot|x_t))\big]}_{\text{entropy}} \]

Why it works

Clipped IS ratios \((\rho_t, c_t)\) tame variance/bias;
Multi-step correction handles policy lag without sacrificing throughput.

Representative efficient actor-critic methods

Category	Example algorithms	Key strengths
On-policy	PPO	Stable, parallelizable, easy; standard in LLM fine-tuning (RLHF)
Off-policy (stochastic)	SAC	Maximum-entropy objective → robust exploration; excellent data efficiency
Distributed	IMPALA, V-trace	Massive scalability; production in DeepMind, OpenAI, Google DeepRL

Efficiency and Performance Comparison

Dimension	MuZero / Sampled MuZero / EfficientZero	PPO / SAC / IMPALA
Sample efficiency	Excellent when planning can reuse a model (Atari, board games)	High for off-policy (SAC); moderate for PPO
Wall-clock / GPU efficiency	Poor (search is serial & CPU-bound)	Very good (fully parallel on GPU)
Robustness & stability	Sensitive to model errors / rollout length	Stable with tuned hyper-parameters
Scalability to real-time tasks	Hard (search latency)	Good; used in robotics, continuous control, large-scale RL (IMPALA, V-trace)
Best-case performance	Outstanding in structured domains (Go, Atari)	State-of-the-art in most continuous-control and real environments

RL for Frontier & World Models

For training:

PPO, DPO & GRPO

For Query

Self-consistency
Tree of Thought (beam-style)
- Maintain and progress frontier nodes in parallel
- Value-Guided Decoding

Self-Consistency for LLM Reasoning (on Chains of Thought)

Idea (Wang et al., 2023):
Instead of trusting a single Chain-of-Thought (CoT), sample many diverse CoTs and aggregate their final answers.

Majority (or verifier-weighted) agreement \(\approx\) more reliable reasoning.

Self Consistency

Prompt the same question with CoT enabled (“think step-by-step”)
Sample \(K\) reasoning paths with stochastic decoding (e.g., temperature \(0.7\!-\!1.0\), nucleus \(p\!\approx\!0.9\)).
Extract final answers from each path.
Aggregate
- Majority vote over final answers (self-consistency).
- or Score with a verifier/rubric and pick highest-scoring.
- Optional: consensus check (e.g., numeric tolerance).

Why it helps

Diversity \(\rightarrow\) reduces single-path errors/hallucinations.
Voting/verification \(\rightarrow\) filters spurious but fluent chains.

How paths are progressed

CoT: each path is a linear sampled chain.
ToT (Tree-of-Thought): expand multiple partial chains (branching), keep top beams.
GoT (Graph-of-Thought): allow branches to merge/reuse subresults; select best subgraph.
In all cases, progress = decode next step (sample or beam), prune with a heuristic/verifier, repeat until a stopping rule.

Practical settings

\(K\): \(10\!-\!40\) (cost vs. accuracy).
Extractor: robust regex/templates for the final answer.
Verifier: separate model or rules (units, constraints, tests).
Failure mode: consistent but wrong consensus \(\rightarrow\) add tools/checks (calculator, code, retrieval).

Self-consistency = ensemble of CoTs + aggregation; ToT/GoT generalize progression by branching/merging before voting or verification.

Vision Transformers (ViTs)

Since ~2020, attention-based Transformers have started competing and often surpassing CNNs in large-scale vision benchmarks.

Image \(\rightarrow\) patches \(\rightarrow\) tokens \(\rightarrow\) transformer

Patchify the image: split an image of size (HW C) into non-overlapping patches (PP).
Number of tokens \(N=\frac{HW}{P^2}\).
Linear patch embedding: flatten each patch \(x_i\in\mathbb{R}^{(P^2C)}\) and project
\(z_i^0 = W_E x_i + b_E \in \mathbb{R}^D\).
(Often implemented as a conv with kernel/stride \(P\).)

Vision Transformers - Tokens and Transformer Encoding

Add a class token and positions: prepend \([\mathrm{CLS}]\) and add learnable positions
\(\tilde{z}_i^0 = z_i^0 + p_i\), with sequence \([\tilde{z}_{\text{CLS}}^0, \tilde{z}_1^0,\ldots,\tilde{z}_N^0]\).
Transformer encoder stack (repeated (L) times):
\(\text{SA}(X) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V\) with multi-head self-attention,
then MLP; both with residuals + layer norm.
Prediction head: take the final \([\mathrm{CLS}]\) (or pool all tokens) \(\to\) linear head () class probs.
Note: smaller \(P\) \(\Rightarrow\) more tokens (detail ↑, cost ↑); larger \(P\) \(\Rightarrow\) fewer tokens (detail ↓, cost ↓).
Variants like Swin use local windows with shifts for scalability; ViT uses global attention.

RT-X Transformers in Robotics

RT-X:

Increasingly transformers are also being used for robotics (e.g. RT-1, RT-2, RT-X Google DeepMind)

large-scale imitation across many robots.

“RT-family” includes hybrid attention across vision, language, and control.

They utilises Visual, Language, Action (VLA) transformers

RL Fine Turning & Query optimisation

Foundation Model	RL/Query Optimiser	Example
Attention-based transformer / LLM	PPO, DPO, GRPO / CoT / ToT/GoT	Gemini 2.5, ChatGPT 3.5-5.0, ChatGTP Operator, Claude Computer Use, DeepSeek R1
Attention-based transformer / Vision + Language + Action (VLA)	PPO	RT-X
RSSN / Control	Diffusion Policy	Dreamer V3

Chain of thought (CoT) / Tree of thought (ToT) & Graph of Thought (GoT)

Convergence of RL, Autoregression & Transformers

Implicit planning through attention inside world models

Transformers can simulate “multiple futures” inside the hidden state

Self-attention makes “lookahead” possible
Attention learns which futures matter
Transformers can simulate “multiple futures” inside the hidden state
Implicit planning is amortised planning
The policy uses the world model’s implicit planning

Autoregression & joint distribution factorisation

Autoregression models a sequence by predicting each element from all previous elements:

\[ x_t = f(x_{1:t-1}) + \epsilon \]

Used in time series, sequence modelling, and autoregressive transformers.

Factorises a joint distribution as:

\[ p(x_{1:T}) = \prod_{t=1}^{T} p(x_t \mid x_{1:t-1}) \]

At inference time, predictions are fed back step-by-step to generate sequences.

Masking in transformers enforces this causal structure.

Masked latent transformers

A masked latent transformer is a transformer that models sequences of latent states, using attention masks to enforce causality or planning structure just like in autoregressive language models.

It is a transformer that predicts (or refines) latent variables in a sequence, but only using allowed past or partial information, enforced through a mask.

Masked latent transformers appear in world-model RL, video generation, and planning-as-generation frameworks.

They are increasingly used as a replacement for RNN/RSSM latent dynamics models in Dreamer-like systems.

Attention-baed transformers (recap)

Transformers model sequences using self-attention, where each token computes weighted interactions with other tokens.

Key components:

Token embeddings and positional encodings
Self-attention:

\[ \mathrm{Attn}(Q, K, V) = \mathrm{softmax}\!\left( \frac{QK^\top}{\sqrt{d_k}} \right) V \]

Multi-head attention, feedforward layers, residual connections

Causal masking

Causal masking is used in autoregressive transformers:

Each token attends only to past tokens
Enforces the autoregressive condition

\[ x_t \sim p(x_t \mid x_{1:t-1}) \]

This masked attention mechanism directly carries over to masked latent transformers used in world-model RL.

Masked latent transformers: LLMs versus World Models

In NLP transformers in large language models:

Tokens = words
Mask = causal mask (can’t see the future)

In masked latent transformers:

Tokens = latent states \(z_t\) (learned hidden representations)
Mask = ensures correct temporal, causal, or planning structure

Three major paradigms

Three major paradigms reflect the convergence of RL, autoregression and transformer models:

Decision Transformer (offline, no planning),
World-Model RL (planning, includes dynamics), and
Actor-Critic Transformers (online, no dynamics).

Case Study: Actor-Critic Transformers

Actor-critic transformers are RL agents that use transformer architectures to parameterize the actor, critic, or both, while still relying on Bellman equations and policy gradients for learning.

Actor-Critic Trnsformers essentially outperform LSTMs in long-horizon POMDP

Comparison

Dimension	Decision Transformer	World-Model RL	Actor–Critic Transformers
Core Idea	Offline sequence modelling of trajectories; imitate high-return behaviour	Learn a predictive dynamics model and plan or imagine futures inside it	Classical actor–critic RL but with transformer networks for policy/critic
Learns dynamics model?	❌ No	✔ Yes (explicit or latent dynamics)	❌ No
Does planning?	❌ No explicit planning	✔ Yes (explicit or implicit)	❌ No planning (just TD + policy gradient)
Uses Bellman equations?	❌ Never	✔ Often (Dreamer, MuZero)	✔ Yes (critic learning)
Uses TD learning?	❌ No	✔ Yes (usually; except pure planners like Trajectory Transformer planning mode)	✔ Yes
Training regime	Offline only	Typically online, but can combine offline + online	Online RL
Policy improvement	❌ None (no search, no DP)	✔ Yes (via planning or imagined rollouts)	✔ Yes (policy gradient)
Value function learned?	❌ No	✔ Yes (in most models)	✔ Yes (critic)
Reward used	Only to compute return-to-go (RTG) labels	Used in Bellman updates + imagination	Used in TD error for critic

Dimension	Decision Transformer	World-Model RL	Actor–Critic Transformers
Primary transformer role	Sequence → next action predictor (GPT-like)	Dynamics model + future predictor + latent planner	Memory encoder for policy/value
Handles partial observability?	✔ Through long context window	✔ Through latent state + prediction	✔ Through transformer memory
Long-horizon credit assignment	Weak (depends on data quality)	Strong (via imagined rollouts or implicit planning)	Moderate (TD propagation + attention)
Required data	High-quality offline trajectories	Real interactions + possibly offline data	Real interactions (on-policy or off-policy)
Exploration	❌ None	✔ Yes (through policy learning or planning)	✔ Yes (inherent to actor–critic)
Generalises beyond dataset?	❌ Mostly no	✔ Yes (model-based planning)	✔ Yes (online improvement)
Analogy	“GPT for actions”	“Agent learns a simulator and plans inside it”	“LSTM actor–critic upgraded to a transformer”
Example algorithms	Decision Transformer; Upside-Down RL	DreamerV2/V3, MuZero, Sampled MuZero, Trajectory Transformer (planning), AWM	GTrXL, Transformer-PPO, Transformer-SAC
Main strength	Simple, powerful offline learning	Efficient long-horizon reasoning and planning	Strong online learning with rich temporal representations
Main weakness	Cannot improve beyond dataset; no true RL	Dynamics learning is hard; model bias	No planning; no model; can be sample-inefficient