COMP90054 Examinable Material (Semester 2, 2025)

The following topics are either examinable (✅) or not examinable (❌).

7.1 Introduction to Reinforcement Learning

  • ✅ Learning and Planning
  • ✅ Atari Example: Planning
  • ✅ Atari Example (without emulator): Reinforcement Learning
  • ✅ Characteristics of Reinforcement Learning

Examples of Reinforcement Learning

  • ✅ Examples of Reinforcement Learning
  • ❌ Example: Make a humanoid robot walk
  • ❌ Example: Fine tuning LLMs using human/AI feedback
  • ❌ Example: Optimising operating system routines
  • ❌ Example: Compete in International Mathematical Olympiad
  • ❌ Evolution of RL in the ERA of experience

Rewards

  • ✅ Rewards
  • ✅ Example of Rewards
  • ✅ Sequential Decision Making
  • ✅ Agent and Environment
  • ✅ History and State
  • ✅ Environment State
  • ✅ Agent State
  • ✅ Markov State
  • ✅ Rat Example

Components of an RL Agent

  • ✅ Components of an RL Agent
  • ✅ Policy
  • ✅ Value Function
  • ✅ Model
  • ✅ Maze Example: Policy
  • ✅ Maze Example: Value Function
  • ✅ Maze Example: Model
  • ✅ Categorizing RL agents
  • ✅ Taxonomy
  • ✅ Exploration and Exploitation
  • ✅ Examples
  • ✅ Prediction and Control

7.2 Introduction to MDPs

  • ✅ Markov Property
  • ✅ State Transition Matrix

Markov Process

  • ✅ Example: Student Markov Chain
  • ✅ Example: Student Markov Chain Episodes & Transition Matrix

Markov Reward Processes

  • ✅ Example: Student MRP
  • ✅ Return
  • ✅ Why discount?
  • ✅ Value Function
  • ✅ Example: Student MRP Returns
  • ✅ Example: State-Value Function for Student MRP
  • ✅ Bellman Equation for MRPs
  • ✅ Bellman Equation for MRPs
  • ✅ Example: Bellman Equation for Student MRP
  • ❌ Bellman Equation in Matrix Form
  • ✅ Solving the Bellman Equation

Markov Decision Processes

  • ✅ Markov Decision Process
  • ✅ Example: Student MDP
  • ✅ Policies & Recovering Markov reward process from MDP
  • ✅ Value Function
  • ✅ Example: State-Value Function for Student MDP
  • ✅ Bellman Expectation Equation
  • ✅ Bellman Expectation Equation for \(V^{\pi}\) (look ahead)
  • ✅ Bellman Expectation Equation for \(Q^{\pi}\) (look ahead)
  • ✅ Bellman Expectation Equation for \(v_{\pi}\) (2)
  • ✅ Bellman Expectation Equation for \(q_{\pi}\) (2)
  • ✅ Example: Bellman Expectation Equation in Student MDP
  • ❌ Bellman Expectation Equation (Matrix Form)

Optimal Value Function

  • ✅ Optimal Value Function (Finding the best behaviour)
  • ✅ Example: Optimal Value Function for Student MDP
  • ✅ Example: Optimal Action-Value Function for Student MDP
  • ✅ Optimal Policy
  • ✅ Finding an Optimal Policy
  • ✅ Example: Optimal Policy for Student MDP
  • ✅ Bellman Optimality Equation for \(v_\ast\)
  • ✅ Bellman Optimality Equation for \(Q^*\) (look ahead)
  • ✅ Bellman Optimality Equation for \(V^*\) (2)
  • ✅ Bellman Optimality Equation for \(Q^*\) (2)
  • ✅ Example: Bellman Optimality Equation in Student MDP
  • ✅ Solving the Bellman Optimality Equation

8 Integrating Learning and Planning (MCTS)

Model-Based and Model-Free RL

  • ✅ Model-Based Reinforcement Learning
  • ✅ Model-Free RL
  • ✅ Model-Based RL
  • ✅ Advantages of Model-Based RL ### Learning a Model
  • ✅ What is a Model?
  • ✅ Model Learning
  • ✅ Examples of Models
  • ✅ Table Lookup Model
  • ✅ AB Example (Revisited) - Building a Model

Planning with a Model

  • ✅ Planning with a Model
  • ✅ Sample-Based Planning
  • ✅ Back to the AB Example
  • ✅ Planning with an Inaccurate Model
  • ✅ Real and Simulated Experience
  • ✅ Integrating Learning and Planning

Example: The Game of Go

  • ❌ Example: The Game of Go
  • ❌ Rules of Go
  • ❌ Position Evaluation in Go
  • ❌ Monte-Carlo Evaluation in Go
  • ❌ Applying Monte-Carlo Tree Search
  • ❌ Advantages of MC Tree Search

9 Model-Free Prediction and Control (MC and TD Learning, Sarsa and Q-Learning)

  • ✅ Model-Free Reinforcement Learning

Monte-Carlo Learning

  • ✅ Monte-Carlo Reinforcement Learning
  • ✅ Monte-Carlo Policy Evaluation
  • ✅ First-Visit Monte-Carlo Policy Evaluation
  • ✅ Every-Visit Monte-Carlo Policy Evaluation
  • ✅ First-Visit versus Every-Visit Monte-Carlo Policy Evaluation?
  • ❌ Blackjack Example
  • ❌ Blackjack Value Function after Monte-Carlo Learning
  • ❌ Incremental Mean (Refresher)
  • ❌ Incremental Monte-Carlo Updates (Same idea)

Temporal-Difference Learning

  • ✅ Temporal-Difference (TD) Learning
  • ✅ MC versus TD
  • ✅ Driving Home Example: MC versus TD
  • ✅ Advantages & Disadvantages of MC versus TD
  • ✅ Bias/Variance Trade-Off
  • ✅ Random Walk Example
  • ✅ Random Walk: MC versus TD
  • ✅ Batch MC and TD
  • ✅ AB Example
  • ✅ Certainty Equivalence
  • ✅ Advantages and Disadvantages of MC versus TD (3)
  • ✅ Monte-Carlo Backup
  • ✅ Temporal-Difference Backup
  • ✅ Tree Search/Dynamic Programming Backup
  • ✅ Bootstrapping and Sampling
  • ❌ Unified View of Reinforcement Learning

TD(\(\lambda\))

  • \(n\)-Step Prediction
  • \(n\)-Step Return
  • ✅ Large Random Walk Example
  • ✅ Averaging \(n\)-Step Returns
  • \(\lambda\)-return
  • ✅ TD(\(\lambda\)) Weighting Function
  • ✅ Forward View of TD(λ)
  • ✅ Forward View of TD(λ) on Large Random Walk
  • ✅ Backward View of TD(\(\lambda\))
  • ✅ Eligibility Traces
  • ✅ Backward View TD(\(\lambda\))
  • ✅ TD(\(\lambda\)) and TD(\(0\))
  • ✅ TD(\(\lambda\)) and MC

Temporal-Difference Search for MCTS

  • ❌ Temporal-Difference Search
  • ❌ MC versus TD search
  • ❌ TD search
  • ❌ Results of TD search in Go

Model-Free Reinforcement Learning

  • ✅ Uses of Model-Free Control
  • ✅ On and Off-Policy Learning

On-Policy Monte-Carlo Control

  • ✅ Generalised Policy Iteration
  • ✅ Principle of Optimality
  • ✅ Generalised Policy Iteration with Monte-Carlo Evaluation
  • ✅ Model-Free Policy Iteration Using Action-Value Function
  • ✅ Generalised Policy Iteration with Action-Value Function
  • ✅ Example of Greedy Action Selection (Bandit problem)
  • \(\varepsilon\)-Greedy Exploration
  • \(\varepsilon\)-Greedy Policy Improvement
  • ✅ Monte-Carlo Policy Iteration
  • ✅ Monte-Carlo Control
  • ❌ Greedy in the Limit with Infinite Exploration (GLIE)
  • ❌ GLIE Monte-Carlo Control
  • ❌ Back to the Blackjack Example
  • ❌ Monte-Carlo Control in Blackjack

On-Policy Temporal-Difference Learning

  • ✅ MC versus TD Control (Gain efficiency by Bootstrapping)
  • ✅ Updating Action-Value Functions with Sarsa
  • ✅ On-Policy Control with Sarsa
  • ✅ Sarsa Algorithm for On-Policy Control
  • ✅ Convergence of Sarsa (Stochastic optimisation theory)
  • ✅ Windy Gridworld Example
  • ✅ Sarsa on the Windy Gridworld
  • \(n\)-Step Sarsa (Bias-variance trade-off)
  • ✅ Forward View Sarsa(\(\lambda\))
  • ✅ Backward View Sarsa(\(\lambda\))
  • ✅ Sarsa(\(\lambda\)) Algorithm
  • ✅ Sarsa(\(\lambda\)) Gridworld Example

Off-Policy Learning

  • ✅ Off-Policy Learning
  • ✅ Importance Sampling for Off-Policy Monte-Carlo
  • ✅ Importance Sampling for Off-Policy TD
  • ✅ Q-Learning
  • ✅ Off-Policy Control with Q-Learning
  • ✅ Q-Learning Control Algorithm
  • ✅ Q-Learning Algorithm for Off-Policy Control
  • ✅ Cliff Walking Example (Sarsa versus Q-Learning)
  • ✅ Relationship between Tree Backup and TD
  • ✅ Relationship between DP and TD

Convergence & Contraction Mapping Theorem (Advanced Topic)

  • ❌ Convergence?
  • ❌ Value Function Space
  • ❌ Value Function \(\infty\)-Norm
  • ❌ Bellman Expectation Backup is a Contraction
  • ❌ Contraction Mapping Theorem
  • ❌ Convergence of Policy Evaluation and Policy Iteration
  • ❌ Bellman Optimality Backup is a Contraction
  • ❌ Convergence of Value Iteration

Relationship Between Forward and Backward TD (Advanced Topic)

  • ❌ TD(\(\lambda\)) and TD(\(0\))
  • ❌ TD(\(\lambda\)) and MC(\(0\))
  • ❌ MC and TD(\(1\))
  • ❌ Telescoping in TD(\(1\))
  • ❌ TD(\(\lambda\)) and TD(\(1\))
  • ❌ Telescoping in TD(\(\lambda\))
  • ❌ Forwards and Backwards TD(\(\lambda\))
  • ❌ Offline Equivalence of Forward and Backward TD
  • ❌ Online Equivalence of Forward and Backward TD
  • ❌ Summary of Forward and Backward TD(\(\lambda\))

10 Value Function Approximation (Deep RL)

Value Function Approximation

  • ✅ Large-Scale Reinforcement Learning
  • ✅ Value Function Approximation
  • ✅ Types of Value Function Approximation
  • ✅ Which Function Approximator?

Incremental Methods - Theory

  • ✅ Gradient Descent
  • ✅ Value Function Approx. By Stochastic Gradient Descent
  • ✅ Feature Vectors
  • ✅ Linear Value Function Approximation
  • ✅ Table Lookup Features

Incremental Methods - Algorithms

  • ✅ Incremental Prediction Algorithms
  • ✅ Monte-Carlo with Value Function Approximation
  • ✅ TD Learning with Value Function Approximation
  • ✅ TD(\(\lambda\)) with Value Function Approximation
  • ✅ Control with Value Function Approximation
  • ✅ Action-Value Function Approximation
  • ✅ Linear Action-Value Function Approximation
  • ✅ Incremental Control Algorithms

Examples (Incremental)

  • ❌ Linear Sarsa with Coarse Coding in Mountain Car
  • ❌ Linear Sarsa with Radial Basis Functions in Mountain Car
  • ❌ Study of λ: Should We Bootstrap?
  • ❌ Baird’s Counterexample—\(TD\) does not always converge
  • ❌ Parameter Divergence in Baird’s Counterexample
  • ❌ Convergence of Prediction Algorithms
  • ❌ Gradient Temporal-Difference Learning
  • ❌ Convergence of Control Algorithms

Batch Methods

  • ✅ Batch Reinforcement Learning
  • ✅ Least Squares Prediction
  • ✅ Stochastic Gradient Descent with Experience Replay
  • ✅ Experience Replay in Deep Q-Networks (DQN)
  • ✅ DQN in Atari (model-free learning from images)
  • ✅ DQN Results in Atari Games - Above human level (left) & below human level (right)
  • ✅ How much does DQN help?

Linear Least Squares Methods

  • ❌ Linear Least Squares Prediction
  • ❌ Linear Least Squares Prediction Algorithms
  • ❌ Convergence of Linear Least Squares Prediction Algorithms
  • ❌ Least Squares Policy Iteration
  • ❌ Least Squares Action-Value Function Approximation
  • ❌ Least Squares Control
  • ❌ Least Squares Q-Learning
  • ❌ Least Squares Policy Iteration Algorithm
  • ❌ Convergence of Control Algorithms
  • ❌ Chain Walk Example (More complicated random walk)
  • ❌ LSPI in Chain Walk: Action-Value Function
  • ❌ LSPI in Chain Walk: Policy

11 Policy Gradient (Actor Critic)

  • ✅ Policy-Based Reinforcement Learning
  • ✅ Value-Based and Policy-Based RL
  • ✅ Advantages of Policy-Based RL
  • ✅ Example: Rock-Paper-Scissors
  • ✅ Example: Aliased Gridworld
  • ✅ Policy Objective Functions
  • ✅ Policy Optimisation

Finite Difference Policy Gradient

  • ✅ Policy Gradient
  • ✅ Computing Gradients By Finite Differences
  • ✅ Retro Example: Training AIBO to Walk by Finite Difference Policy Gradient

Monte-Carlo Policy Gradient

  • ✅ Score Function
  • ✅ Softmax (Primer)
  • ✅ Softmax Policy
  • ✅ Gaussian Policy
  • ✅ One-Step MDPs
  • ✅ Policy Gradient Theorem
  • ✅ Monte-Carlo Policy Gradient (REINFORCE)
  • ✅ Puck World Example (REINFORCE)

Actor-Critic Policy Gradient

  • ✅ Reducing Variance Using a Critic
  • ✅ Estimating the Action-Value Function
  • ✅ Action-Value Actor-Critic
  • ✅ Bias in Actor-Critic Algorithms
  • ❌ Compatible Function Approximation Theorem
  • ❌ Proof of Compatible Function Approximation Theorem
  • ✅ Reducing Variance Using a Baseline
  • ✅ Advantage Function
  • ❌ Estimating the Advantage Function (Option 1)
  • ✅ Estimating the Advantage Function (Easier Option 2)
  • ❌ Critics at Different Time-Scales
  • ❌ Actors at Different Time-Scales (Same Idea)
  • ❌ Policy Gradient with Eligibility Traces
  • ❌ Summary of Policy Gradient Algorithms

Natural Actor-Critic (Advanced Topic)

  • ❌ Alternative Policy Gradient Directions
  • ❌ Compatible Function Approximation (CFA)
  • ❌ Compatibility Conditions for Critic
  • ❌ Compatibility Conditions—A Beautiful Result
  • ❌ Natural Policy Gradient—Fisher Information Matrix
  • ❌ Natural Actor-Critic—Compatible Function Approximation
  • ❌ Natural Policy Gradient—Parameterisation Invariant
  • ❌ Natural Actor Critic in Snake Domain

Proximal Policy Optimisation (PPO) (Advanced Topic)

  • ❌ Proximal Policy Optimisation (PPO)
  • ❌ Policy Objective Functions (Revisited)
  • ❌ Surrogate Objective
  • ❌ Clipped Surrogate (Core Idea)
  • ❌ Complete PPO loss
  • ❌ Generalised Advantage Estimation (GAE)
  • ❌ PPO Algorithm
  • ❌ Why PPO works (intuition)
  • ❌ PPO Fine tuning ChatGPT & human feedback (Revisited)
  • ❌ PPO in Agentic AI, Robotics & World Models
  • ❌ Recent Policy Optimisation Techniques

12 Foundation, World and Transformer Models (Advanced Topic)

AlphaGo, AlphaZero & MuZero

  • ❌ AlphaGo - Core Idea
  • ❌ AlphaGo - Neural networks & losses
  • ❌ AlphaGo Planning integration
  • ❌ AlphaZero - Unified Self-Play RL
  • ❌ AlphaZero - Neural network and objective
  • ❌ AlphaZero - Learning & Planning Loop
  • ❌ MuZero - Learning to Plan Without Rules
  • ❌ MuZero - Neural networks (3)
  • ❌ MuZero — TD-Style Learning (Bootstrapped Returns)
  • ❌ MuZero - Training and Integration MuZero - Results and Significance
  • ❌ AlphaGo, AlphaZero & MuZero Comparison

Foundation, World and Transformer Models

  • ❌ Three Layer Agent Stack
  • ❌ Modern Trends
  • ❌ Requirements of RL for Foundation and World Models
  • ❌ Differentiable Planning: Soft Actor Critic (SAC)
  • ❌ Differentiable Planning: Dreamer V3
  • ❌ Recurrent State Space Model (RSSM): Dreamer V3
  • ❌ Comparison of MuZero with Dreamer V3
  • ❌ Take-away summary
  • ❌ Distributed RL: IMPALA & V-trace
  • ❌ Representative efficient actor-critic methods
  • ❌ Efficiency and Performance Comparison
  • ❌ RL for Frontier & World Models
  • ❌ Self-Consistency for LLM Reasoning (on Chains of Thought)
  • ❌ Vision Transformers (ViTs)
  • ❌ Vision Transformers - Tokens and Transformer Encoding
  • ❌ RT-X Transformers in Robotics
  • ❌ RL Fine Turning & Query optimisation