7.1 Introduction to Reinforcement Learning

Slides

This module is also available in the following versions

Topics

Threads

First thread: prediction & planning:

Monte Carlo (MC) prediction (episodic evaluation).
Temporal Difference (TD) learning (bootstrapping).
Monte Carlo Tree Search (MCTS) (planning with a model).
Exploration vs exploitation (Multi-arm Bandits) (upper confidence bounds (UCT))

Second thread: model-free control:

MC control (prediction + policy improvement loop).
TD control → Q-learning, SARSA.

Third thread: approximation:

Value function approximation (linear → deep).
Policy approximation (policy gradients).
Actor–critic (integration of both).

Readings & Textbook

COMP90054 Reinforcement Learning page - includes readings for each module and slides from live lectures

Available as html book at: https://comp90054.github.io/reinforcement-learning/

Reinforcement Learning, An Introduction, Second Edition Sutton and Barto, MIT Press, 2020

Available free at : http://www.incompleteideas.net/book/RLbook2020.pdf

Learning and Planning

Two fundamental problems in sequential decision making

Planning (first half of COMP90054):

A model of the environment is known
The agent performs computations with its model (without any external interaction)
The agent improves its policy through search, deliberation, reasoning, and introspection

Reinforcement Learning (this half of COMP90054):

The environment is initially unknown
The agent interacts with the environment
The agent improves its policy

Atari Example

Atari Example: Planning

Rules of the game are known

Can query emulator (simulator)

perfect model inside agent’s brain

If I take action \(a\) from state \(s\):

what would the next state \(s'\) be?
what would the score be?

Plan ahead to find optimal policy, e.g. heuristic tree search, novelty etc.

Atari Example (without emulator): Reinforcement Learning

Rules of the game are unknown
Pick joystick actions, only see pixels & scores
Learn directly from interactive game-play

Characteristics of Reinforcement Learning

What makes reinforcement learning different from automated planning?

The outcomes of actions are non-deterministic
Uses probabilistic representation

What makes reinforcement learning different from other machine learning paradigms?

Sequence matters, i.e. it involves non-i.i.d. (independent and identically distributed) data
There is no supervisor, only a reward signal
Feedback is delayed, not instantaneous

Examples of Reinforcement Learning

Making a humanoid robot walk
Fine tuning LLMs using human/AI feedback
Optimising operating system routines
Controlling a power station
Managing an investment portfolio

Competing in the International Mathematical Olympiad
Multi-agent pathfinding by robots in a warehouse (COMP90054 Assignment A2)

Example: Make a humanoid robot walk

Atlas demonstrates policies using RL based on human motion capture and animation - Boston Dynamics 2025

Example: Fine tuning LLMs using human/AI feedback

Proximal policy update (PPO) used by ChatGPT 3.5 & Agentic AI - Chat GPT ‘Operator’ & Claude’s ‘Computer Use’. ChatGPT4- now uses Direct Policy Optimisation (DPO).

Group relative policy optimisation (GRPO) (more stable than PPO) used in DeepSeek’s R2.

Example: Optimising operating system routines

2023, Daniel J. Mankowitz, et al. Faster sorting algorithms discovered using deep reinforcement learning, Nature, Vol 618, pp. 257-273

DeepMind’s AlphaDev, a deep reinforcement learning agent, has discovered faster sorting algorithms, outperforming previously known human benchmarks.

The sort routine is called up to a trillion times a day worldwide
These newly discovered algorithms have already been integrated into the LLVM standard C++ sort library.

Example: Compete in International Mathematical Olympiad

https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/

AlphaProof achieved a sliver medal using a new reinforcement-learning based system for formal math reasoning (see link above).

Evolution of RL in the ERA of experience

Figure: David Silver & Rich Sutton, The ERA of Experience, 2025

Rewards

A reward \(R_t\) is a scalar feedback signal.
Indicates how well agent is doing at step \(t\)
The agent’s job is to maximise cumulative reward

Reinforcement learning is based on the reward hypothesis

Definition (Reward Hypothesis):
All goals can be described by the maximisation of expected cumulative reward.

Example of Rewards

Make a humanoid robot walk
- -ve reward for falling
- +ve reward for forward motion
Optimise sort routine in an operating system
- -ve reward for execution time
- +ve reward for throughput

Control plasma in a stellarator in a fusion power station
- +ve reward for containment of plasma
- -ve reward for plasma crashing
Control inventory in a warehouse
- -ve reward for stock-out penalty (lost sales)
- -ve reward for holding costs (inventory)
- +ve reward for sales revenue

Sequential Decision Making

Goal: select actions to maximise total future reward

Actions may have long term consequences
Reward may be delayed
It may be better to sacrifice immediate reward to gain more long-term reward

Examples:

A financial investment (may take months to mature)
Re-stocking warehouse (might prevent a stock-outs in days or weeks)

Agent

Agent and Environment

At each step \(t\) the agent:

Executes action \(A_t\)
Receives observation \(O_t\)
Receives scalar reward \(R_t\)

The environment:

Receives action \(A_t\)

Emits observation \(O_{t+1}\)
Emits scalar reward \(R_{t+1}\)
\(t\) increments at env. step

History and State

The history is sequence of observations, actions, rewards

\[ H_t = O_1,R_1,A_1, \ldots A_{t-1}, O_t, R_t \] - i.e. the stream of a robot’s actions, observations and rewards up to time \(t\)

What happens next depends on the history:

The agent selects actions, and
the environment selects observations/rewards.

State is the information used to determine what happens next.

Formally, a state is a function of the history: \(S_t = f(H_t )\)

Environment State

The environment state \(S^e_t\) is the environment’s private representation

i.e. data environment uses to pick the next observation/reward

The environment state is not usually visible to the agent

Even if \(S^e_t\) is visible, it may contain irrelevant info

Agent State

The agent state \(S^a_t\) is the agent’s internal representation

i.e. information the agent uses to pick the next action
i.e. information used by reinforcement learning algorithms

It can be any function of history: \(S^a_t = f(H_t)\)

Markov State

A Markov state (a.k.a. Information state) contains all useful information from the history.

Definition: A state \(S_t\) is Markov if and only if \[ \mathbb{P} [S_{t+1} | S_t ] = \mathbb{P} [S_{t+1}\ |\ S_1, \ldots, S_t ] \] “The future is independent of the past given the present” \[ H_{1:t} \rightarrow S_t \rightarrow H_{t+1:\infty} \]

Once the state is known, the history may be thrown away

i.e. The state is a sufficient statistic of the future
The environment state \(S^e_t\) is Markov
The history \(H_t\) is Markov

Rat Example

What if agent state = last \(3\) items in sequence?
What if agent state = counts for lights, bells and levers?
What if agent state = complete sequence?

Components of an RL Agent

An RL agent may include one or more of these components:

Policy: agent’s behaviour function
Value function: how good is each state and/or action
Model: agent’s representation of the environment

Policy

A policy is the agent’s behaviour

It is a map from state to action, e.g.

Deterministic policy: \(a = \pi(s)\)
Stochastic policy: \(\pi(a|s) = \mathbb{P}[A_t = a|S_t = s]\)

Value Function

Value function is a prediction of future reward

Used to evaluate the goodness/badness of states, and
therefore to select between actions, e.g.

\[ v_{\pi}(s) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots | S_t = s] \]

Model

A model predicts what the environment will do next

\(\mathcal{P}\) predicts the probability of the next state

\(\mathcal{R}\) predicts the expectation of the next reward, e.g.

\[ \mathcal{P}^a_{ss'} = \mathbb{P}[S_{t+1} = s' | S_t = s, A_t = a] \]

\[ \mathcal{R}^a_s = \mathbb{E}[R_{t+1} | S_t = s, A_t = a] \]

Maze Example

Rewards: -1 per time-step
Actions: N, E, S, W
States: Agent’s location

Maze Example: Policy

Arrows represent policy \(\pi(s)\) for each state \(s\)

Maze Example: Value Function

Numbers represent value \(v_{\pi}(s)\) of each state \(s\)

Maze Example: Model

Agent may have an internal model of the environment

Dynamics: how actions change the state
Rewards: how much reward from each state
The model may be imperfect

Grid layout represents transition model \(\mathcal{P}^a_{ss'}\)

Numbers represent immediate reward \(\mathcal{R}^a_s\) from each state \(s\) (same for all \(a\))

Categorizing RL agents

Value Based

No Policy (Implicit)
Value Function

Policy Based

Policy
No Value Function

Actor Critic

Policy
Value Function

Model Free

Policy and/or Value Function
No Model

Model Based

Policy and/or Value Function
Model

Taxonomy

Exploration and Exploitation

Reinforcement learning is like trial-and-error learning

The agent should discover a good policy…
…from its experiences of the environment…
…without losing too much reward along the way.

Exploration finds more information about the environment
Exploitation exploits known information to maximise reward
It is usually important to explore as well as exploit

Examples

Restaurant Selection
Exploitation Go to your favourite restaurant
Exploration Try a new restaurant
Online Banner Advertisements
Exploitation Show the most successful advert
Exploration Show a different advert
Gold exploration
Exploitation Drill core samples at the best known location
Exploration Drill at a new location

Game Playing
Exploitation Play the move you believe is best
Exploration Play an experimental move

Prediction and Control

Prediction: evaluate the future

Given a best policy

Control: optimise the future

find the best policy

We need to solve the prediction problem in order to solve the control problem

i.e. we need evaluate all of our policies in order to work out which is the best one