Last Module (5):
Integrating learning and planning
Use planning to construct a value function or policy
This Module (6):
Model-free prediction
Prediction: Optimise the value function of an unknown MDP
MC methods learn directly from episodes of experience
MC learns from complete episodes
MC uses the simplest possible idea of looking at sample returns: value = mean return
\[ S_1, A_1, R_2, \ldots, S_k \sim \pi \]
\[ G_t = R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{T-1} R_T \]
\[ v_\pi(s) = \mathbb{E}_\pi \left[ G_t \mid S_t = s \right] \]
Monte-Carlo policy evaluation uses empirical mean return instead of expected return
Computes empirical mean from the point \(t\) onwards using as many samples as we can
Will be different for every time step
To evaluate state, \(s\):
By Law of Large Numbers, \(V(s) \to v_\pi(s)\) as \(N(s) \to \infty\)
The Central Limit Theorem tells us how quickly it approaches the mean
The variance (mean squared error) of estimator reduces with \(\frac{1}{N}\)
i.e. rate is independent of size of state space, \(|\mathcal{s}|\)
speed depends on how many episodes/visits reach \(s\) (coverage probabilities).
To evaluate state, \(s\):
\[ V(s) = \frac{S(s)}{N(s)} \]
Again, \(V(s) \to v_\pi(s)\) as \(N(s) \to \infty\).
Every-Visit Advantages:
Especially good when episodes are short or when states are rarely visited — no sample gets “wasted.”
i.e. uses more of the data collected, often faster convergence in practice.
First-visit Advantages:
Useful when episodes are long and states repeated many times.
i.e. avoids dependence between multiple visits to the same state in one episode.
States (\(\sim 200\)):
Current sum of cards (\(12-21\))
Dealer’s showing card (Ace\(-10\))
Whether you have a usable ace - (can be counted as \(1\) or \(11\) without busting \(>21\)) (yes/no)
Actions:
Stand: stop receiving cards (and terminate)
Hit: take another card (no replacement)
Transitions:

Rewards:
For stand: \(+1\) if your sum \(>\) dealer’s; \(0\) if equal; \(−1\) if less
For hit: \(−1\) if your sum \(> 21\) (and terminate); \(0\) otherwise
Policy: stand if sum of cards ≥ 20, otherwise hit
Learning value function directly from experience
Usable ace value is noisier because the state is rarer
Key point: once we have learned the value function from experience,
The mean \(\mu_1, \mu_2, \cdots\) of a sequence \(x_1, x_2, \dots\) can be computed incrementally,
\[\begin{align*} \mu_k & = \frac{1}{k} \sum_{j=1}^k x_j\\[2pt] & = \frac{1}{k}\left(x_k + \sum_{j=1}^{k-1} x_j\right)\\[2pt] & = \frac{1}{k}(x_k + (k - 1) \mu_{k-1})\\[2pt] & = \mu_{k-1} + \frac{1}{k}({\color{red}{x_k - \mu_{k-1}}}) \end{align*}\]
\({\color{red}{\mu_{k-1}}}\) is the previous mean: predicts what think value will be
\({{\color{red}{x_k}}}\) is the new value
Incrementally corrects mean \(\frac{1}{k}\) in direction of error \({\color{red}{x_k - \mu_{k-1}}}\)
Update \(V(s)\) incrementally after each episode \(S_1, A_1, R_2, \dots, S_T\):
\[ \begin{aligned} N(S_t) &\leftarrow N(S_t) + 1 \\ V(S_t) &\leftarrow V(S_t) + \frac{1}{N(S_t)}\,({\color{red}{G_t - V(S_t)}}) \end{aligned} \]
In non-stationary problems, it can be useful to track a running mean by forgetting old episodes using a constant step size \(\alpha\):
\[ V(S_t) \leftarrow V(S_t) + \alpha ( G_t - V(S_t) ) \]
A constant step size turns the MC estimate into an exponentially weighted moving average of past returns
In general, we prefer non-stationary estimators because our policy we will be evaluating is continuously improving
In summary, in Monte-Carlo learning we
TD methods learn directly from episodes of experience
TD learns from incomplete episodes, by bootstrapping
It substitutes, or bootstraps, reminder of the trajectory with the estimate of what will happen, instead of waiting for full returns
i.e. TD updates one guess with a subsequent guess
Goal: learn \(v_\pi\) online from experience under policy \(\pi\)
Incremental every-visit Monte-Carlo:
Update value \(V(S_t)\) toward actual return \({\color{red}{G_t}}\)
\[ V(S_t) \leftarrow V(S_t) + \alpha \left({\color{red}{G_t}}- V(S_t) \right) \]
Simplest temporal-difference learning algorithm: TD(0):
Update value \(V(S_t)\) toward estimated return \({\color{red}{R_{t+1} + \gamma V (S_{t+1})}}\) (like Bellman equation)
\[ V(S_t) \leftarrow V(S_t) + \alpha \left({\color{red}{R_{t+1} + \gamma V(S_{t+1})}} - V(S_t) \right) \]
\(R_{t+1} + \gamma V(S_{t+1})\) is called the TD target (we are moving towards) \(\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\) is called the TD error
| State | Elapsed Time (min) | Predicted Time to Go | Predicted Total Time |
|---|---|---|---|
| leaving office | 0 | 30 | 30 |
| reach car, raining | 5 | 35 | 40 |
| exit highway | 20 | 15 | 35 |
| behind truck | 30 | 10 | 40 |
| home street | 40 | 3 | 43 |
| arrive home | 43 | 0 | 43 |
Changes recommended by Monte Carlo methods \((\alpha=1)\):
Changes recommended by TD methods \((\alpha=1)\):
Red arrow represent recommended updates by MC and TD respectively
TD can learn before knowing the final outcome
TD can learn without the final outcome
Return \(G_t = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{T-1} R_T\) is an unbiased estimate of \(v_\pi(S_t)\).
True TD target \(R_{t+1} + {\color{blue}{\gamma\, v_\pi(S_{t+1}}})\) is an unbiased estimate of \(v_\pi(S_t)\).
TD target \(R_{t+1} + \gamma\, V(S_{t+1})\) is a biased estimate of \(v_\pi(S_t)\).
TD target has much lower variance than the return, since
Return depends on many random actions, transitions, rewards.
TD target depends on one random action, transition, reward.
MC has high variance, zero bias
Good convergence properties (even with function approximation)
Not very sensitive to initial value
Very simple to understand and use
TD has low variance, some bias
Usually more efficient than MC
TD(0) converges to \(v_{\pi}(s)\) (but not always with function approximation)
More sensitive to initial value
Random Walk example based on uniform random policy: left 0.5 and right 0.5
MC and TD both converge in the limit
What about a batch solution for finite experience, \(k\) finite episodes?
\[ s^1_1, a^1_1, r^1_2, \cdots s^1_{T_1} \]
\[ \vdots \]
\[ s^K_1, a^K_1, r^K_2, \cdots s^K_{T_K} \]
In batch mode you repeatedly sample episode \(k \in [1, K]\) and apply MC or TD(0) to episode \(k\)
Two states \(A, B\); no discounting; 8 episodes:
\(A, 0, B, 0\)
\(B, 1\)
\(B, 1\)
\(B, 1\)
\(B, 1\)
\(B, 1\)
\(B, 1\)
\(B, 0\)
Clearly, \(V(B)\) = \(\frac{6}{8} = 0.75\), but what about \(V(A)\)?
MC converges to solution with minimum mean-squared error
\[ \sum_{k=1}^K \sum_{t=1}^{T_k} \left( G_t^k - V(s_t^k) \right)^2 \]
TD(\(0\)) converges to solution of max likelihood Markov model that best explains the data
\[ \hat{\mathcal{P}}_{s,s'}^a \;=\; \frac{1}{N(s,a)} \sum_{k=1}^K \sum_{t=1}^{T_k} \mathbf{1}\!\left( s_t^k, a_t^k, s_{t+1}^k = s, a, s' \right) \]
\[ \hat{\mathcal{R}}_s^a \;=\; \frac{1}{N(s,a)} \sum_{k=1}^K \sum_{t=1}^{T_k} \mathbf{1}\!\left( s_t^k, a_t^k = s, a \right) r_t^k \]
TD exploits Markov property
MC does not exploit Markov property
Usually more effective in non-Markov environments
Note that partial observability and non-stationarity are reasons an environment can be non-Markov
\[ V(S_t) \;\leftarrow\; V(S_t) + \alpha \big( G_t - V(S_t) \big) \]
Starting at one state, sample one complete trajectory to update the value function
\[ V(S_t) \;\leftarrow\; V(S_t) + \alpha \big( R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \big) \]
In TD backup is just over one step
\[ V(S_t) \;\leftarrow\; \mathbb{E}_\pi \!\left[ R_{t+1} + \gamma V(S_{t+1}) \right] \]
If we know the dynamics of the environment, we can do search and a complete backup over the tree
Bootstrapping: update involves an estimate
MC does not bootstrap
Tree Search (with heuristic search) or dynamic programming bootstraps
TD bootstraps
Sampling: update samples an expectation
MC samples
Tree Search does not sample
TD samples
Dynamic programming only explores one level, in it’s most simple form.
In practice, dynamic programming is used during tree search, similar to classical planning.
Let TD target look \(n\) steps into the future
Consider the following \(n\)-step returns for \(n=1, 2, \infty\): \[\begin{align*} \color{red}{n} & \color{red}{= 1}\ \text{(TD(0))} & G_t^{(1)} & = R_{t+1} + \gamma V(S_{t+1})\\[2pt] \color{red}{n} & \color{red}{= 2} & G_t^{(2)} & = R_{t+1} + \gamma R_{t+2} + \gamma^2 V(S_{t+2})\\[2pt] \color{red}{\vdots} & & \vdots & \\[2pt] \color{red}{n} & \color{red}{= \infty}\ \text{(MC)} & G_t^{(\infty)} & = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{T-1} R_T \end{align*}\]
Define the \(n\)-step Return (real reward \(+\) estimated reward, \({\color{blue}{\gamma^n V(S_{t+n})}}\)):
\[ G^{(n)}_t = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + {\color{blue}{\gamma^n V(S_{t+n})}} \]
\(n\)-step temporal-difference learning (updated in direction of error)
\[ V(S_t) \;\leftarrow\; V(S_t) + \alpha \big({\color{red}{G_t^{(n)} - V(S_t)}} \big) \]
MC looks at the real reward
So which \(n\) is the best?
Note that RMS errors also vary according to whether learning is on-line or off-line updates (not shown here)
We can form mixtures of different \(n\):
e.g. average of \(2\)-step and \(4\)-step returns:
\[ \frac{1}{2} G_t^{(2)} + \frac{1}{2} G_t^{(4)} \]
We can average \(n\)-step returns over different n
Combines information from two different time-steps
Can we efficiently combine information from all time-steps to be more robust?
The \(\lambda\)-return \(G_t^{\lambda}\) combines all \(n\)-step returns \(G_t^{(n)}\)
\[ G_t^{\lambda} \;=\; {\color{red}{(1 - \lambda) \sum_{n=1}^{\infty} \lambda^{\,n-1}}} G_t^{(n)} \]
Forward-view TD(\(\lambda\))
\[ V(S_t) \;\leftarrow\; V(S_t) + \alpha \big( G_t^{\lambda} - V(S_t) \big) \]
\[ G_t^{\lambda} \;=\; (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{\,n-1} G_t^{(n)} \]
Note that geometric weightings are memoryless
However, the formulation presented so far is a forward view
Update value function towards the \(\lambda\)-return
Forward-view looks into the future to compute \(G^{\lambda}_t\)
Like MC, can only be computed from complete episodes
We will see shortly how an iterative algorithm achieves the forward view without having to wait until the future
We can see using TD(\(\lambda\)), and choosing \(\lambda\) value (left hand side), is more robust than choosing a unique \(n\)-step value (right hand side)
Forward view provides theory
Backward view provides mechanism
Update online, every step, from incomplete sequences
Credit assignment problem: did bell or light cause shock?
Frequency heuristic: assign credit to most frequent states
Regency heuristic: assign credit to most recent states
Eligibility traces combine both heuristics
\[\begin{align*} E_0(s) & = 0\\[2pt] E_t(s) & = \gamma \lambda E_{t-1}(s) + 1(S_t = s) \end{align*}\]
Keep an eligibility trace for every state \(s\)
Update value \(V(s)\) for every state \(s\)
In proportion to TD-error \(\delta_t\) and eligibility trace \(E_t(s)\)
\[ \delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \]
\[ V(s) \;\leftarrow\; V(s) + \alpha \,\delta_t E_t(s) \]
When \(\lambda = 0\), only the current state is updated.
\[E_t(s) = \mathbf{1}(S_t = s)\] \[V(s) \leftarrow V(s) + \alpha\,\delta_t\,E_t(s)\]
This is exactly equivalent to the TD(\(0\)) update.
\[V(S_t) \leftarrow V(S_t) + \alpha\,\delta_t\]
When \(\lambda = 1\), credit is deferred until the end of the episode.
Theorem
The sum of offline updates is identical for forward-view and backward-view TD(\(\lambda\)):
\[ \sum_{t=1}^{T} \alpha\,\delta_t\,E_t(s) \;=\; \sum_{t=1}^{T} \alpha \bigl( G_t^{\lambda} - V(S_t) \bigr)\,\mathbf{1}(S_t = s). \]
Simulation-based search
\(\ldots\) using TD instead of MC (bootstrapping)
MC tree search applies MC control to sub-MDP from now
TD search applies Sarsa to sub-MDP from now
For model-free reinforcement learning, bootstrapping is helpful
TD learning reduces variance but increases bias
TD learning is usually more efficient than MC
TD(\(\lambda\)) can be much more efficient than MC
For simulation-based search, bootstrapping is also helpful
TD search reduces variance but increases bias
TD search is usually more efficient than MC search
TD(\(\lambda\)) search can be much more efficient than MC search
Simulate episodes from the current (real) state \(s_t\)
Estimate action-value function \(Q(s,a)\)
\[ \Delta Q(S,A) = \alpha \big( R + \gamma Q(S',A') - Q(S,A) \big) \]
May also use function approximation for \(Q\), if needed
Black dashed line is MCTS
Blu line is Dyna-Q (not covered in this module)
Learning from simulation is an effective method in search