COMP90054 Examinable Material (Semester 2, 2025)

The following topics are either examinable (✅) or not examinable (❌).

7.1 Introduction to Reinforcement Learning

✅ Learning and Planning
✅ Atari Example: Planning
✅ Atari Example (without emulator): Reinforcement Learning
✅ Characteristics of Reinforcement Learning

Examples of Reinforcement Learning

✅ Examples of Reinforcement Learning
❌ Example: Make a humanoid robot walk
❌ Example: Fine tuning LLMs using human/AI feedback
❌ Example: Optimising operating system routines
❌ Example: Compete in International Mathematical Olympiad
❌ Evolution of RL in the ERA of experience

Rewards

✅ Rewards
✅ Example of Rewards
✅ Sequential Decision Making
✅ Agent and Environment
✅ History and State
✅ Environment State
✅ Agent State
✅ Markov State
✅ Rat Example

Components of an RL Agent

✅ Components of an RL Agent
✅ Policy
✅ Value Function
✅ Model
✅ Maze Example: Policy
✅ Maze Example: Value Function
✅ Maze Example: Model
✅ Categorizing RL agents
✅ Taxonomy
✅ Exploration and Exploitation
✅ Examples
✅ Prediction and Control

7.2 Introduction to MDPs

✅ Markov Property
✅ State Transition Matrix

Markov Process

✅ Example: Student Markov Chain
✅ Example: Student Markov Chain Episodes & Transition Matrix

Markov Reward Processes

✅ Example: Student MRP
✅ Return
✅ Why discount?
✅ Value Function
✅ Example: Student MRP Returns
✅ Example: State-Value Function for Student MRP
✅ Bellman Equation for MRPs
✅ Bellman Equation for MRPs
✅ Example: Bellman Equation for Student MRP
❌ Bellman Equation in Matrix Form
✅ Solving the Bellman Equation

Markov Decision Processes

✅ Markov Decision Process
✅ Example: Student MDP
✅ Policies & Recovering Markov reward process from MDP
✅ Value Function
✅ Example: State-Value Function for Student MDP
✅ Bellman Expectation Equation
✅ Bellman Expectation Equation for \(V^{\pi}\) (look ahead)
✅ Bellman Expectation Equation for \(Q^{\pi}\) (look ahead)
✅ Bellman Expectation Equation for \(v_{\pi}\) (2)
✅ Bellman Expectation Equation for \(q_{\pi}\) (2)
✅ Example: Bellman Expectation Equation in Student MDP
❌ Bellman Expectation Equation (Matrix Form)

Optimal Value Function

✅ Optimal Value Function (Finding the best behaviour)
✅ Example: Optimal Value Function for Student MDP
✅ Example: Optimal Action-Value Function for Student MDP
✅ Optimal Policy
✅ Finding an Optimal Policy
✅ Example: Optimal Policy for Student MDP
✅ Bellman Optimality Equation for \(v_\ast\)
✅ Bellman Optimality Equation for \(Q^*\) (look ahead)
✅ Bellman Optimality Equation for \(V^*\) (2)
✅ Bellman Optimality Equation for \(Q^*\) (2)
✅ Example: Bellman Optimality Equation in Student MDP
✅ Solving the Bellman Optimality Equation

8 Integrating Learning and Planning (MCTS)

Model-Based and Model-Free RL

✅ Model-Based Reinforcement Learning
✅ Model-Free RL
✅ Model-Based RL
✅ Advantages of Model-Based RL ### Learning a Model
✅ What is a Model?
✅ Model Learning
✅ Examples of Models
✅ Table Lookup Model
✅ AB Example (Revisited) - Building a Model

Planning with a Model

✅ Planning with a Model
✅ Sample-Based Planning
✅ Back to the AB Example
✅ Planning with an Inaccurate Model
✅ Real and Simulated Experience
✅ Integrating Learning and Planning

Simulation-Based Search

✅ Simulation-Based Search
✅ Simple Monte-Carlo Search
✅ Simple Monte-Carlo Tree Search (Evaluation)
✅ Simple Monte-Carlo Tree Search (MCTS)

Example: The Game of Go

❌ Example: The Game of Go
❌ Rules of Go
❌ Position Evaluation in Go
❌ Monte-Carlo Evaluation in Go
❌ Applying Monte-Carlo Tree Search
❌ Advantages of MC Tree Search

9 Model-Free Prediction and Control (MC and TD Learning, Sarsa and Q-Learning)

✅ Model-Free Reinforcement Learning

Monte-Carlo Learning

✅ Monte-Carlo Reinforcement Learning
✅ Monte-Carlo Policy Evaluation
✅ First-Visit Monte-Carlo Policy Evaluation
✅ Every-Visit Monte-Carlo Policy Evaluation
✅ First-Visit versus Every-Visit Monte-Carlo Policy Evaluation?
❌ Blackjack Example
❌ Blackjack Value Function after Monte-Carlo Learning
❌ Incremental Mean (Refresher)
❌ Incremental Monte-Carlo Updates (Same idea)

Temporal-Difference Learning

✅ Temporal-Difference (TD) Learning
✅ MC versus TD
✅ Driving Home Example: MC versus TD
✅ Advantages & Disadvantages of MC versus TD
✅ Bias/Variance Trade-Off
✅ Random Walk Example
✅ Random Walk: MC versus TD
✅ Batch MC and TD
✅ AB Example
✅ Certainty Equivalence
✅ Advantages and Disadvantages of MC versus TD (3)
✅ Monte-Carlo Backup
✅ Temporal-Difference Backup
✅ Tree Search/Dynamic Programming Backup
✅ Bootstrapping and Sampling
❌ Unified View of Reinforcement Learning

TD(\(\lambda\))

✅ \(n\)-Step Prediction
✅ \(n\)-Step Return
✅ Large Random Walk Example
✅ Averaging \(n\)-Step Returns
✅ \(\lambda\)-return
✅ TD(\(\lambda\)) Weighting Function
✅ Forward View of TD(λ)
✅ Forward View of TD(λ) on Large Random Walk
✅ Backward View of TD(\(\lambda\))
✅ Eligibility Traces
✅ Backward View TD(\(\lambda\))
✅ TD(\(\lambda\)) and TD(\(0\))
✅ TD(\(\lambda\)) and MC

Temporal-Difference Search for MCTS

❌ Temporal-Difference Search
❌ MC versus TD search
❌ TD search
❌ Results of TD search in Go

Model-Free Reinforcement Learning

✅ Uses of Model-Free Control
✅ On and Off-Policy Learning

On-Policy Monte-Carlo Control

✅ Generalised Policy Iteration
✅ Principle of Optimality
✅ Generalised Policy Iteration with Monte-Carlo Evaluation
✅ Model-Free Policy Iteration Using Action-Value Function
✅ Generalised Policy Iteration with Action-Value Function
✅ Example of Greedy Action Selection (Bandit problem)
✅ \(\varepsilon\)-Greedy Exploration
✅ \(\varepsilon\)-Greedy Policy Improvement
✅ Monte-Carlo Policy Iteration
✅ Monte-Carlo Control
❌ Greedy in the Limit with Infinite Exploration (GLIE)
❌ GLIE Monte-Carlo Control
❌ Back to the Blackjack Example
❌ Monte-Carlo Control in Blackjack

On-Policy Temporal-Difference Learning

✅ MC versus TD Control (Gain efficiency by Bootstrapping)
✅ Updating Action-Value Functions with Sarsa
✅ On-Policy Control with Sarsa
✅ Sarsa Algorithm for On-Policy Control
✅ Convergence of Sarsa (Stochastic optimisation theory)
✅ Windy Gridworld Example
✅ Sarsa on the Windy Gridworld
✅ \(n\)-Step Sarsa (Bias-variance trade-off)
✅ Forward View Sarsa(\(\lambda\))
✅ Backward View Sarsa(\(\lambda\))
✅ Sarsa(\(\lambda\)) Algorithm
✅ Sarsa(\(\lambda\)) Gridworld Example

Off-Policy Learning

✅ Off-Policy Learning
✅ Importance Sampling for Off-Policy Monte-Carlo
✅ Importance Sampling for Off-Policy TD
✅ Q-Learning
✅ Off-Policy Control with Q-Learning
✅ Q-Learning Control Algorithm
✅ Q-Learning Algorithm for Off-Policy Control
✅ Cliff Walking Example (Sarsa versus Q-Learning)
✅ Relationship between Tree Backup and TD
✅ Relationship between DP and TD

Convergence & Contraction Mapping Theorem (Advanced Topic)

❌ Convergence?
❌ Value Function Space
❌ Value Function \(\infty\)-Norm
❌ Bellman Expectation Backup is a Contraction
❌ Contraction Mapping Theorem
❌ Convergence of Policy Evaluation and Policy Iteration
❌ Bellman Optimality Backup is a Contraction
❌ Convergence of Value Iteration

Relationship Between Forward and Backward TD (Advanced Topic)

❌ TD(\(\lambda\)) and TD(\(0\))
❌ TD(\(\lambda\)) and MC(\(0\))
❌ MC and TD(\(1\))
❌ Telescoping in TD(\(1\))
❌ TD(\(\lambda\)) and TD(\(1\))
❌ Telescoping in TD(\(\lambda\))
❌ Forwards and Backwards TD(\(\lambda\))
❌ Offline Equivalence of Forward and Backward TD
❌ Online Equivalence of Forward and Backward TD
❌ Summary of Forward and Backward TD(\(\lambda\))

10 Value Function Approximation (Deep RL)

Value Function Approximation

✅ Large-Scale Reinforcement Learning
✅ Value Function Approximation
✅ Types of Value Function Approximation
✅ Which Function Approximator?

Incremental Methods - Theory

✅ Gradient Descent
✅ Value Function Approx. By Stochastic Gradient Descent
✅ Feature Vectors
✅ Linear Value Function Approximation
✅ Table Lookup Features

Incremental Methods - Algorithms

✅ Incremental Prediction Algorithms
✅ Monte-Carlo with Value Function Approximation
✅ TD Learning with Value Function Approximation
✅ TD(\(\lambda\)) with Value Function Approximation
✅ Control with Value Function Approximation
✅ Action-Value Function Approximation
✅ Linear Action-Value Function Approximation
✅ Incremental Control Algorithms

Examples (Incremental)

❌ Linear Sarsa with Coarse Coding in Mountain Car
❌ Linear Sarsa with Radial Basis Functions in Mountain Car
❌ Study of λ: Should We Bootstrap?
❌ Baird’s Counterexample—\(TD\) does not always converge
❌ Parameter Divergence in Baird’s Counterexample
❌ Convergence of Prediction Algorithms
❌ Gradient Temporal-Difference Learning
❌ Convergence of Control Algorithms

Batch Methods

✅ Batch Reinforcement Learning
✅ Least Squares Prediction
✅ Stochastic Gradient Descent with Experience Replay
✅ Experience Replay in Deep Q-Networks (DQN)
✅ DQN in Atari (model-free learning from images)
✅ DQN Results in Atari Games - Above human level (left) & below human level (right)
✅ How much does DQN help?

Linear Least Squares Methods

❌ Linear Least Squares Prediction
❌ Linear Least Squares Prediction Algorithms
❌ Convergence of Linear Least Squares Prediction Algorithms
❌ Least Squares Policy Iteration
❌ Least Squares Action-Value Function Approximation
❌ Least Squares Control
❌ Least Squares Q-Learning
❌ Least Squares Policy Iteration Algorithm
❌ Convergence of Control Algorithms
❌ Chain Walk Example (More complicated random walk)
❌ LSPI in Chain Walk: Action-Value Function
❌ LSPI in Chain Walk: Policy

11 Policy Gradient (Actor Critic)

✅ Policy-Based Reinforcement Learning
✅ Value-Based and Policy-Based RL
✅ Advantages of Policy-Based RL
✅ Example: Rock-Paper-Scissors
✅ Example: Aliased Gridworld
✅ Policy Objective Functions
✅ Policy Optimisation

Finite Difference Policy Gradient

✅ Policy Gradient
✅ Computing Gradients By Finite Differences
✅ Retro Example: Training AIBO to Walk by Finite Difference Policy Gradient

Monte-Carlo Policy Gradient

✅ Score Function
✅ Softmax (Primer)
✅ Softmax Policy
✅ Gaussian Policy
✅ One-Step MDPs
✅ Policy Gradient Theorem
✅ Monte-Carlo Policy Gradient (REINFORCE)
✅ Puck World Example (REINFORCE)

Actor-Critic Policy Gradient

✅ Reducing Variance Using a Critic
✅ Estimating the Action-Value Function
✅ Action-Value Actor-Critic
✅ Bias in Actor-Critic Algorithms
❌ Compatible Function Approximation Theorem
❌ Proof of Compatible Function Approximation Theorem
✅ Reducing Variance Using a Baseline
✅ Advantage Function
❌ Estimating the Advantage Function (Option 1)
✅ Estimating the Advantage Function (Easier Option 2)
❌ Critics at Different Time-Scales
❌ Actors at Different Time-Scales (Same Idea)
❌ Policy Gradient with Eligibility Traces
❌ Summary of Policy Gradient Algorithms

Natural Actor-Critic (Advanced Topic)

❌ Alternative Policy Gradient Directions
❌ Compatible Function Approximation (CFA)
❌ Compatibility Conditions for Critic
❌ Compatibility Conditions—A Beautiful Result
❌ Natural Policy Gradient—Fisher Information Matrix
❌ Natural Actor-Critic—Compatible Function Approximation
❌ Natural Policy Gradient—Parameterisation Invariant
❌ Natural Actor Critic in Snake Domain

Proximal Policy Optimisation (PPO) (Advanced Topic)

❌ Proximal Policy Optimisation (PPO)
❌ Policy Objective Functions (Revisited)
❌ Surrogate Objective
❌ Clipped Surrogate (Core Idea)
❌ Complete PPO loss
❌ Generalised Advantage Estimation (GAE)
❌ PPO Algorithm
❌ Why PPO works (intuition)
❌ PPO Fine tuning ChatGPT & human feedback (Revisited)
❌ PPO in Agentic AI, Robotics & World Models
❌ Recent Policy Optimisation Techniques

12 Foundation, World and Transformer Models (Advanced Topic)

AlphaGo, AlphaZero & MuZero

❌ AlphaGo - Core Idea
❌ AlphaGo - Neural networks & losses
❌ AlphaGo Planning integration
❌ AlphaZero - Unified Self-Play RL
❌ AlphaZero - Neural network and objective
❌ AlphaZero - Learning & Planning Loop
❌ MuZero - Learning to Plan Without Rules
❌ MuZero - Neural networks (3)
❌ MuZero — TD-Style Learning (Bootstrapped Returns)
❌ MuZero - Training and Integration MuZero - Results and Significance
❌ AlphaGo, AlphaZero & MuZero Comparison

Foundation, World and Transformer Models

❌ Three Layer Agent Stack
❌ Modern Trends
❌ Requirements of RL for Foundation and World Models
❌ Differentiable Planning: Soft Actor Critic (SAC)
❌ Differentiable Planning: Dreamer V3
❌ Recurrent State Space Model (RSSM): Dreamer V3
❌ Comparison of MuZero with Dreamer V3
❌ Take-away summary
❌ Distributed RL: IMPALA & V-trace
❌ Representative efficient actor-critic methods
❌ Efficiency and Performance Comparison
❌ RL for Frontier & World Models
❌ Self-Consistency for LLM Reasoning (on Chains of Thought)
❌ Vision Transformers (ViTs)
❌ Vision Transformers - Tokens and Transformer Encoding
❌ RT-X Transformers in Robotics
❌ RL Fine Turning & Query optimisation