A machine learning paradigm in which agents learn to make sequential decisions in an uncertain environment, guided by rewards and penalties.
Two fundamental problems in sequential decision making
Planning (first half of COMP90054):
A model of the environment is known
The agent performs computations with its model (without any external interaction)
The agent improves its policy through search, deliberation, reasoning, and introspection
Reinforcement Learning (this half of COMP90054):
The environment is initially unknown
The agent interacts with the environment
The agent improves its policy
Rules of the game are known
Can query emulator (simulator)
If I take action \(a\) from state \(s\):
what would the next state \(s'\) be?
what would the score be?
Plan ahead to find optimal policy, e.g. heuristic tree search, novelty etc.


Rules of the game are unknown
Pick joystick actions, only see pixels & scores
Learn directly from interactive game-play
See appendix for more details
| ☀️ Sunny | 🌧️ Raining | |
|---|---|---|
|
Bring ☂️ |
😒 | 🙂 |
|
Don’t bring ☂️ |
😀 | 😭 |
Exp(☂️) = P(☀️)·U(😒) + P(🌧️)·U(🙂)
Goals are characterised as the maximisation of expected cumulative reward.
Goal: select actions to maximise total future reward
Actions may have long term consequences
Reward may be delayed
It may be better to sacrifice immediate reward to gain more long-term reward
Examples:
A financial investment (may take months to mature)
A game-playing agent for a game scored at the end of many rounds of play
Reinforcement learning is like trial-and-error learning
The agent should discover a good policy…
…from its experiences of the environment…
…without losing too much reward along the way.
Exploration finds more information about the environment
Exploitation exploits known information to maximise reward
Restaurant Selection
Exploitation Go to your favourite restaurant
Exploration Try a new restaurant
Online Banner Advertisements
Exploitation Show the most successful advert
Exploration Show a different advert
Gold exploration
Exploitation Drill core samples at the best known location
Exploration Drill at a new location
Set of states \(S\)
Initial state \(s_0\)
Actions \(A(s)\)
Transition function \(s' = f(a, s)\)
Goals \(S_G \subseteq S\)
Action costs \(c(a, s)\)
Set of states \(S\)
Initial state \(s_0\)
Actions \(A(s)\)
Transition probabilities \(P_a(s' \mid s)\)
Reward function \(r(s, a, s')\) positive or negative of transitioning from state \(s\) to state \(s'\)
Discount factor \(0 \leq \gamma \leq 1\)
Definition
A Markov Decision Process is a tuple \(<\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma>\)
\(\mathcal{S}\) is a finite set of states
\(\mathcal{{A}}\) is a finite set of actions
\(\mathcal{P}\) is a state transition probability matrix, \(P^{a}_{s,s'} = \mathbb{P} [S_{t+1} = s' | S_t = s, A_t = a]\)
\(\mathcal{R}\) is a reward function, \(\mathcal{R}^{a}_{s, s'} = \mathbb{E}[R_{t+1} | S_t = s, A_t = {a}, S_{t+1} = s' ]\)
\(\gamma\) is a discount factor \(\gamma \in [0,1]\).

An episode, history, or trajectory through an MDP is the sequence of states and actions that occur as an agent traverses an MDP.
Example:

state = (1, 1), (try to) go up,
state = (2, 1), go right,
state = (3, 1), go up,
state = (3, 2), (try to) go up
state = (4, 2) TERMINAL
Definition: Return
The return \(G_t\) is the total discounted reward from time-step \(t\). \[ G_t \;=\; R_{t+1} \;+\; \gamma R_{t+2} \;+\; \dots \;=\; \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \]
The discount \(\gamma \in [0, 1]\) is the present value of future rewards
The value of receiving reward \(R\) after \(k+1\) time-steps is \(\gamma^k R\).
This values immediate reward above delayed reward
\(\gamma\) close to \(0\) leads to “myopic” evaluation
\(\gamma\) close to \(1\) leads to “far-sighted” evaluation
Why are Markov reward/decision processes often discounted?
Avoids infinite returns for infinite (or arbitrarily long) trajectories
Reflects uncertainty about the future when anticipating outcomes
Expresses if and how we value efficiency/speed when ranking solutions

At each step \(t\) the agent:
Executes action \(A_t\)
Receives observation \(O_t\)
Receives scalar reward \(R_t\)
The environment:
Receives action \(A_t\)
Emits observation \(O_{t+1}\)
Emits scalar reward \(R_{t+1}\)
\(t\) increments at env. step
The history is sequence of observations, actions, rewards
\[ H_t = O_1,R_1,A_1, \ldots A_{t-1}, O_t, R_t \] - i.e. the stream of a robot’s actions, observations and rewards up to time \(t\)
What happens next depends on the history:
State is the information used to determine what happens next.
Formally, a state is a function of the history: \(S_t = f(H_t )\)

The environment state \(S^e_t\) is the environment’s private representation
The environment state is not usually visible to the agent

The agent state \(S^a_t\) is the agent’s internal representation
i.e. information the agent uses to pick the next action
i.e. information used by reinforcement learning algorithms
It can be any function of history: \(S^a_t = f(H_t)\)
A Markov state (a.k.a. Information state) contains all useful information from the history.
Definition: A state \(S_t\) is Markov if and only if \[ \mathbb{P} [S_{t+1} | S_t ] = \mathbb{P} [S_{t+1}\ |\ S_1, \ldots, S_t ] \] “The future is independent of the past given the present” \[ H_{1:t} \rightarrow S_t \rightarrow H_{t+1:\infty} \]
What if agent state = last \(n\) items in sequence?
What if agent state = counts for lights, bells and levers?
A “solution” to an MDP, i.e. an RL agent may include one or more of these components, depending on the method used:
Policy: agent’s behaviour function
Value function: how good is each state and/or action
Model: agent’s representation of the environment
A policy is the agent’s behaviour
It is a map from state to action, e.g.
Deterministic policy: \(a = \pi(s)\)
Stochastic policy: \(\pi(a|s) = \mathbb{P}[A_t = a|S_t = s]\)

Value function is a prediction of future reward
Used to evaluate the goodness/badness of states, and
therefore to select between actions, e.g.
\[ V_{\pi}(s) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots | S_t = s] \]
Key component of policy and and state evaluation in most RL methods
\[ V(s) = \overbrace{\max_{a \in A(s)}}^{\text{best action from $s$}} \overbrace{\underbrace{\sum_{s' \in S}}_{\text{for every state}} P_a(s' \mid s) [\underbrace{r(s,a,s')}_{\text{immediate reward}} + \underbrace{\gamma}_{\text{discount factor}} \cdot \underbrace{V(s')}_{\text{value of } s'}]}^{\text{expected reward of executing action $a$ in state $s$}} \]
Idea: the value of a states/actions can be expressed in terms of an estimate of the short term reward associated with the anticipated action to be taken in that state, plus a discounted estimate of the value of the anticipated successor state after the action is taken (calculated the same way, recursively).
A model predicts what the environment will do next
\(\mathcal{P}\) predicts the probability of the next state
\(\mathcal{R}\) predicts the expectation of the next reward, e.g.
\[ \mathcal{P}^a_{ss'} = \mathbb{P}[S_{t+1} = s' | S_t = s, A_t = a] \]
\[ \mathcal{R}^a_s = \mathbb{E}[R_{t+1} | S_t = s, A_t = a] \]
Value Based
No Policy (Implicit)
Value Function
Policy Based
Policy
No Value Function
Actor Critic
Model Free
Policy and/or Value Function
No Model
Model Based
Policy and/or Value Function
Model