COMP90054 Examinable Material (Semester 2, 2025)
The following topics are either examinable (✅) or not examinable (❌).
7.1 Introduction to Reinforcement Learning
- ✅ Learning and Planning
- ✅ Atari Example: Planning
- ✅ Atari Example (without emulator): Reinforcement Learning
- ✅ Characteristics of Reinforcement Learning
Examples of Reinforcement Learning
- ✅ Examples of Reinforcement Learning
- ❌ Example: Make a humanoid robot walk
- ❌ Example: Fine tuning LLMs using human/AI feedback
- ❌ Example: Optimising operating system routines
- ❌ Example: Compete in International Mathematical Olympiad
- ❌ Evolution of RL in the ERA of experience
Rewards
- ✅ Rewards
- ✅ Example of Rewards
- ✅ Sequential Decision Making
- ✅ Agent and Environment
- ✅ History and State
- ✅ Environment State
- ✅ Agent State
- ✅ Markov State
- ✅ Rat Example
Components of an RL Agent
- ✅ Components of an RL Agent
- ✅ Policy
- ✅ Value Function
- ✅ Model
- ✅ Maze Example: Policy
- ✅ Maze Example: Value Function
- ✅ Maze Example: Model
- ✅ Categorizing RL agents
- ✅ Taxonomy
- ✅ Exploration and Exploitation
- ✅ Examples
- ✅ Prediction and Control
7.2 Introduction to MDPs
- ✅ Markov Property
- ✅ State Transition Matrix
Markov Process
- ✅ Example: Student Markov Chain
- ✅ Example: Student Markov Chain Episodes & Transition Matrix
Markov Reward Processes
- ✅ Example: Student MRP
- ✅ Return
- ✅ Why discount?
- ✅ Value Function
- ✅ Example: Student MRP Returns
- ✅ Example: State-Value Function for Student MRP
- ✅ Bellman Equation for MRPs
- ✅ Bellman Equation for MRPs
- ✅ Example: Bellman Equation for Student MRP
- ❌ Bellman Equation in Matrix Form
- ✅ Solving the Bellman Equation
Markov Decision Processes
- ✅ Markov Decision Process
- ✅ Example: Student MDP
- ✅ Policies & Recovering Markov reward process from MDP
- ✅ Value Function
- ✅ Example: State-Value Function for Student MDP
- ✅ Bellman Expectation Equation
- ✅ Bellman Expectation Equation for \(V^{\pi}\) (look ahead)
- ✅ Bellman Expectation Equation for \(Q^{\pi}\) (look ahead)
- ✅ Bellman Expectation Equation for \(v_{\pi}\) (2)
- ✅ Bellman Expectation Equation for \(q_{\pi}\) (2)
- ✅ Example: Bellman Expectation Equation in Student MDP
- ❌ Bellman Expectation Equation (Matrix Form)
Optimal Value Function
- ✅ Optimal Value Function (Finding the best behaviour)
- ✅ Example: Optimal Value Function for Student MDP
- ✅ Example: Optimal Action-Value Function for Student MDP
- ✅ Optimal Policy
- ✅ Finding an Optimal Policy
- ✅ Example: Optimal Policy for Student MDP
- ✅ Bellman Optimality Equation for \(v_\ast\)
- ✅ Bellman Optimality Equation for \(Q^*\) (look ahead)
- ✅ Bellman Optimality Equation for \(V^*\) (2)
- ✅ Bellman Optimality Equation for \(Q^*\) (2)
- ✅ Example: Bellman Optimality Equation in Student MDP
- ✅ Solving the Bellman Optimality Equation
8 Integrating Learning and Planning (MCTS)
Model-Based and Model-Free RL
- ✅ Model-Based Reinforcement Learning
- ✅ Model-Free RL
- ✅ Model-Based RL
- ✅ Advantages of Model-Based RL ### Learning a Model
- ✅ What is a Model?
- ✅ Model Learning
- ✅ Examples of Models
- ✅ Table Lookup Model
- ✅ AB Example (Revisited) - Building a Model
Planning with a Model
- ✅ Planning with a Model
- ✅ Sample-Based Planning
- ✅ Back to the AB Example
- ✅ Planning with an Inaccurate Model
- ✅ Real and Simulated Experience
- ✅ Integrating Learning and Planning
Simulation-Based Search
- ✅ Simulation-Based Search
- ✅ Simple Monte-Carlo Search
- ✅ Simple Monte-Carlo Tree Search (Evaluation)
- ✅ Simple Monte-Carlo Tree Search (MCTS)
Example: The Game of Go
- ❌ Example: The Game of Go
- ❌ Rules of Go
- ❌ Position Evaluation in Go
- ❌ Monte-Carlo Evaluation in Go
- ❌ Applying Monte-Carlo Tree Search
- ❌ Advantages of MC Tree Search
9 Model-Free Prediction and Control (MC and TD Learning, Sarsa and Q-Learning)
- ✅ Model-Free Reinforcement Learning
Monte-Carlo Learning
- ✅ Monte-Carlo Reinforcement Learning
- ✅ Monte-Carlo Policy Evaluation
- ✅ First-Visit Monte-Carlo Policy Evaluation
- ✅ Every-Visit Monte-Carlo Policy Evaluation
- ✅ First-Visit versus Every-Visit Monte-Carlo Policy Evaluation?
- ❌ Blackjack Example
- ❌ Blackjack Value Function after Monte-Carlo Learning
- ❌ Incremental Mean (Refresher)
- ❌ Incremental Monte-Carlo Updates (Same idea)
Temporal-Difference Learning
- ✅ Temporal-Difference (TD) Learning
- ✅ MC versus TD
- ✅ Driving Home Example: MC versus TD
- ✅ Advantages & Disadvantages of MC versus TD
- ✅ Bias/Variance Trade-Off
- ✅ Random Walk Example
- ✅ Random Walk: MC versus TD
- ✅ Batch MC and TD
- ✅ AB Example
- ✅ Certainty Equivalence
- ✅ Advantages and Disadvantages of MC versus TD (3)
- ✅ Monte-Carlo Backup
- ✅ Temporal-Difference Backup
- ✅ Tree Search/Dynamic Programming Backup
- ✅ Bootstrapping and Sampling
- ❌ Unified View of Reinforcement Learning
TD(\(\lambda\))
- ✅ \(n\)-Step Prediction
- ✅ \(n\)-Step Return
- ✅ Large Random Walk Example
- ✅ Averaging \(n\)-Step Returns
- ✅ \(\lambda\)-return
- ✅ TD(\(\lambda\)) Weighting Function
- ✅ Forward View of TD(λ)
- ✅ Forward View of TD(λ) on Large Random Walk
- ✅ Backward View of TD(\(\lambda\))
- ✅ Eligibility Traces
- ✅ Backward View TD(\(\lambda\))
- ✅ TD(\(\lambda\)) and TD(\(0\))
- ✅ TD(\(\lambda\)) and MC
Temporal-Difference Search for MCTS
- ❌ Temporal-Difference Search
- ❌ MC versus TD search
- ❌ TD search
- ❌ Results of TD search in Go
Model-Free Reinforcement Learning
- ✅ Uses of Model-Free Control
- ✅ On and Off-Policy Learning
On-Policy Monte-Carlo Control
- ✅ Generalised Policy Iteration
- ✅ Principle of Optimality
- ✅ Generalised Policy Iteration with Monte-Carlo Evaluation
- ✅ Model-Free Policy Iteration Using Action-Value Function
- ✅ Generalised Policy Iteration with Action-Value Function
- ✅ Example of Greedy Action Selection (Bandit problem)
- ✅ \(\varepsilon\)-Greedy Exploration
- ✅ \(\varepsilon\)-Greedy Policy Improvement
- ✅ Monte-Carlo Policy Iteration
- ✅ Monte-Carlo Control
- ❌ Greedy in the Limit with Infinite Exploration (GLIE)
- ❌ GLIE Monte-Carlo Control
- ❌ Back to the Blackjack Example
- ❌ Monte-Carlo Control in Blackjack
On-Policy Temporal-Difference Learning
- ✅ MC versus TD Control (Gain efficiency by Bootstrapping)
- ✅ Updating Action-Value Functions with Sarsa
- ✅ On-Policy Control with Sarsa
- ✅ Sarsa Algorithm for On-Policy Control
- ✅ Convergence of Sarsa (Stochastic optimisation theory)
- ✅ Windy Gridworld Example
- ✅ Sarsa on the Windy Gridworld
- ✅ \(n\)-Step Sarsa (Bias-variance trade-off)
- ✅ Forward View Sarsa(\(\lambda\))
- ✅ Backward View Sarsa(\(\lambda\))
- ✅ Sarsa(\(\lambda\)) Algorithm
- ✅ Sarsa(\(\lambda\)) Gridworld Example
Off-Policy Learning
- ✅ Off-Policy Learning
- ✅ Importance Sampling for Off-Policy Monte-Carlo
- ✅ Importance Sampling for Off-Policy TD
- ✅ Q-Learning
- ✅ Off-Policy Control with Q-Learning
- ✅ Q-Learning Control Algorithm
- ✅ Q-Learning Algorithm for Off-Policy Control
- ✅ Cliff Walking Example (Sarsa versus Q-Learning)
- ✅ Relationship between Tree Backup and TD
- ✅ Relationship between DP and TD
Convergence & Contraction Mapping Theorem (Advanced Topic)
- ❌ Convergence?
- ❌ Value Function Space
- ❌ Value Function \(\infty\)-Norm
- ❌ Bellman Expectation Backup is a Contraction
- ❌ Contraction Mapping Theorem
- ❌ Convergence of Policy Evaluation and Policy Iteration
- ❌ Bellman Optimality Backup is a Contraction
- ❌ Convergence of Value Iteration
Relationship Between Forward and Backward TD (Advanced Topic)
- ❌ TD(\(\lambda\)) and TD(\(0\))
- ❌ TD(\(\lambda\)) and MC(\(0\))
- ❌ MC and TD(\(1\))
- ❌ Telescoping in TD(\(1\))
- ❌ TD(\(\lambda\)) and TD(\(1\))
- ❌ Telescoping in TD(\(\lambda\))
- ❌ Forwards and Backwards TD(\(\lambda\))
- ❌ Offline Equivalence of Forward and Backward TD
- ❌ Online Equivalence of Forward and Backward TD
- ❌ Summary of Forward and Backward TD(\(\lambda\))
10 Value Function Approximation (Deep RL)
Value Function Approximation
- ✅ Large-Scale Reinforcement Learning
- ✅ Value Function Approximation
- ✅ Types of Value Function Approximation
- ✅ Which Function Approximator?
Incremental Methods - Theory
- ✅ Gradient Descent
- ✅ Value Function Approx. By Stochastic Gradient Descent
- ✅ Feature Vectors
- ✅ Linear Value Function Approximation
- ✅ Table Lookup Features
Incremental Methods - Algorithms
- ✅ Incremental Prediction Algorithms
- ✅ Monte-Carlo with Value Function Approximation
- ✅ TD Learning with Value Function Approximation
- ✅ TD(\(\lambda\)) with Value Function Approximation
- ✅ Control with Value Function Approximation
- ✅ Action-Value Function Approximation
- ✅ Linear Action-Value Function Approximation
- ✅ Incremental Control Algorithms
Examples (Incremental)
- ❌ Linear Sarsa with Coarse Coding in Mountain Car
- ❌ Linear Sarsa with Radial Basis Functions in Mountain Car
- ❌ Study of λ: Should We Bootstrap?
- ❌ Baird’s Counterexample—\(TD\) does not always converge
- ❌ Parameter Divergence in Baird’s Counterexample
- ❌ Convergence of Prediction Algorithms
- ❌ Gradient Temporal-Difference Learning
- ❌ Convergence of Control Algorithms
Batch Methods
- ✅ Batch Reinforcement Learning
- ✅ Least Squares Prediction
- ✅ Stochastic Gradient Descent with Experience Replay
- ✅ Experience Replay in Deep Q-Networks (DQN)
- ✅ DQN in Atari (model-free learning from images)
- ✅ DQN Results in Atari Games - Above human level (left) & below human level (right)
- ✅ How much does DQN help?
Linear Least Squares Methods
- ❌ Linear Least Squares Prediction
- ❌ Linear Least Squares Prediction Algorithms
- ❌ Convergence of Linear Least Squares Prediction Algorithms
- ❌ Least Squares Policy Iteration
- ❌ Least Squares Action-Value Function Approximation
- ❌ Least Squares Control
- ❌ Least Squares Q-Learning
- ❌ Least Squares Policy Iteration Algorithm
- ❌ Convergence of Control Algorithms
- ❌ Chain Walk Example (More complicated random walk)
- ❌ LSPI in Chain Walk: Action-Value Function
- ❌ LSPI in Chain Walk: Policy
11 Policy Gradient (Actor Critic)
- ✅ Policy-Based Reinforcement Learning
- ✅ Value-Based and Policy-Based RL
- ✅ Advantages of Policy-Based RL
- ✅ Example: Rock-Paper-Scissors
- ✅ Example: Aliased Gridworld
- ✅ Policy Objective Functions
- ✅ Policy Optimisation
Finite Difference Policy Gradient
- ✅ Policy Gradient
- ✅ Computing Gradients By Finite Differences
- ✅ Retro Example: Training AIBO to Walk by Finite Difference Policy Gradient
Monte-Carlo Policy Gradient
- ✅ Score Function
- ✅ Softmax (Primer)
- ✅ Softmax Policy
- ✅ Gaussian Policy
- ✅ One-Step MDPs
- ✅ Policy Gradient Theorem
- ✅ Monte-Carlo Policy Gradient (REINFORCE)
- ✅ Puck World Example (REINFORCE)
Actor-Critic Policy Gradient
- ✅ Reducing Variance Using a Critic
- ✅ Estimating the Action-Value Function
- ✅ Action-Value Actor-Critic
- ✅ Bias in Actor-Critic Algorithms
- ❌ Compatible Function Approximation Theorem
- ❌ Proof of Compatible Function Approximation Theorem
- ✅ Reducing Variance Using a Baseline
- ✅ Advantage Function
- ❌ Estimating the Advantage Function (Option 1)
- ✅ Estimating the Advantage Function (Easier Option 2)
- ❌ Critics at Different Time-Scales
- ❌ Actors at Different Time-Scales (Same Idea)
- ❌ Policy Gradient with Eligibility Traces
- ❌ Summary of Policy Gradient Algorithms
Natural Actor-Critic (Advanced Topic)
- ❌ Alternative Policy Gradient Directions
- ❌ Compatible Function Approximation (CFA)
- ❌ Compatibility Conditions for Critic
- ❌ Compatibility Conditions—A Beautiful Result
- ❌ Natural Policy Gradient—Fisher Information Matrix
- ❌ Natural Actor-Critic—Compatible Function Approximation
- ❌ Natural Policy Gradient—Parameterisation Invariant
- ❌ Natural Actor Critic in Snake Domain
Proximal Policy Optimisation (PPO) (Advanced Topic)
- ❌ Proximal Policy Optimisation (PPO)
- ❌ Policy Objective Functions (Revisited)
- ❌ Surrogate Objective
- ❌ Clipped Surrogate (Core Idea)
- ❌ Complete PPO loss
- ❌ Generalised Advantage Estimation (GAE)
- ❌ PPO Algorithm
- ❌ Why PPO works (intuition)
- ❌ PPO Fine tuning ChatGPT & human feedback (Revisited)
- ❌ PPO in Agentic AI, Robotics & World Models
- ❌ Recent Policy Optimisation Techniques
12 Foundation, World and Transformer Models (Advanced Topic)
AlphaGo, AlphaZero & MuZero
- ❌ AlphaGo - Core Idea
- ❌ AlphaGo - Neural networks & losses
- ❌ AlphaGo Planning integration
- ❌ AlphaZero - Unified Self-Play RL
- ❌ AlphaZero - Neural network and objective
- ❌ AlphaZero - Learning & Planning Loop
- ❌ MuZero - Learning to Plan Without Rules
- ❌ MuZero - Neural networks (3)
- ❌ MuZero — TD-Style Learning (Bootstrapped Returns)
- ❌ MuZero - Training and Integration MuZero - Results and Significance
- ❌ AlphaGo, AlphaZero & MuZero Comparison
Foundation, World and Transformer Models
- ❌ Three Layer Agent Stack
- ❌ Modern Trends
- ❌ Requirements of RL for Foundation and World Models
- ❌ Differentiable Planning: Soft Actor Critic (SAC)
- ❌ Differentiable Planning: Dreamer V3
- ❌ Recurrent State Space Model (RSSM): Dreamer V3
- ❌ Comparison of MuZero with Dreamer V3
- ❌ Take-away summary
- ❌ Distributed RL: IMPALA & V-trace
- ❌ Representative efficient actor-critic methods
- ❌ Efficiency and Performance Comparison
- ❌ RL for Frontier & World Models
- ❌ Self-Consistency for LLM Reasoning (on Chains of Thought)
- ❌ Vision Transformers (ViTs)
- ❌ Vision Transformers - Tokens and Transformer Encoding
- ❌ RT-X Transformers in Robotics
- ❌ RL Fine Turning & Query optimisation