With MC learning, we generate episodes use a bandit strategy, and update a \(Q\)-table based on observed returns after each episode.
Problem 1: inefficient due to high variance in return \(G\) each episode, meaning \(Q\) converges slowly.
Problem 2: updates only happen after complete episode, so information not usable in “real time” during an episode.
Temporal difference methods replace actual future rewards \(G\) from an episode in the update rule with an “informed estimate” of the total expected rewards in the current episode based on immediate rewards
\[ Q(s,a) \leftarrow (1- \alpha)\underbrace{Q(s,a)}_\text{old value} + \overbrace{\alpha}^{\text{learning rate}} \cdot [\underbrace{\overbrace{r}^{\text{reward}} + \overbrace{\gamma}^{\text{discount factor}} \cdot V(s')}_{\text{TD target}}] \]
After selecting action \(a\) with bandit strategy based on \(Q\), observe immediate reward \(r\) and resulting state \(s'\), and update \(Q\) as
\[ Q(s,a) \leftarrow (1- \alpha)\underbrace{Q(s,a)}_\text{old value} + \overbrace{\alpha}^{\text{learning rate}} \cdot [\overbrace{r}^{\text{reward}} + \overbrace{\gamma}^{\text{discount factor}} \cdot \underbrace{\max_{a' \in A(s')}Q(s', a')}_{V(s') \text{ estimate}}] \]
Question: How does this change address the variance problem with Q-learning?
After selecting action \(a\), observe immediate reward \(r\) and resulting state \(s'\), and select next action \(a' \in A(s')\) using bandit strategy. Update \(Q\) using expected reward from \(a'\) rather than \(\text{argmax}_a Q(s, a)\):
\[ Q(s,a) \leftarrow (1- \alpha)\underbrace{Q(s,a)}_\text{old value} + \overbrace{\alpha}^{\text{learning rate}} \cdot [\overbrace{r}^{\text{reward}} + \overbrace{\gamma}^{\text{discount factor}} \cdot \underbrace{Q(s', a')}_{V(s') \text{ estimate}}] \]
SARSA is on-policy (wheras Q-learning is off-policy), because update rule is influenced by actual actions taken during learning
On-policy methods reinforce \(Q\) according to actual behaviour during training
Off-policy methods separate the goals of learning the optimal values for \(Q\) from considering how \(Q\) is used to generate specific episodes

Q-learning

SARSA
Question: How do we choose between using on- or off-policy methods?
Let TD target look \(n\) steps into the future
Consider the following \(n\)-step returns for \(n=1, 2, \infty\): \[\begin{align*} \color{red}{n} & \color{red}{= 1}\ \text{(TD(0))} & G_t^{(1)} & = r_{t+1} + \gamma V(s_{t+1})\\[2pt] \color{red}{n} & \color{red}{= 2} & G_t^{(2)} & = r_{t+1} + \gamma r_{t+2} + \gamma^2 V(s_{t+2})\\[2pt] \color{red}{\vdots} & & \vdots & \\[2pt] \color{red}{n} & \color{red}{= \infty}\ \text{(MC)} & G_t^{(\infty)} & = r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{T-1} r_T \end{align*}\]
Define the \(n\)-step Return (real reward \(+\) estimated reward, \({\color{blue}{\gamma^n V(s_{t+n})}}\)):
\[ G^{(n)}_t = r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{n-1} r_{t+n} + {\color{blue}{\gamma^n V(s_{t+n})}} \]
\(n\)-step temporal-difference learning (updated in direction of error)
\[ Q(s_t, a_t) \;\leftarrow\; Q(s_t, a_t) + \alpha \big({\color{red}{G_t^{(n)} - Q(s_t, a_t)}} \big) \]
The \(\lambda\)-return \(G_t^{\lambda}\) combines all \(n\)-step returns \(G_t^{(n)}\)
\[ G_t^{\lambda} \;=\; {\color{red}{(1 - \lambda) \sum_{n=1}^{\infty} \lambda^{\,n-1}}} G_t^{(n)} \]
Forward-view TD(\(\lambda\))
\[ Q(s_t, a_t) \;\leftarrow\; Q(s_t, a_t) + \alpha \big( G_t^{\lambda} - Q(s_t, a_t) \big) \]
\[ G_t^{\lambda} \;=\; (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{\,n-1} G_t^{(n)} \]
Update value function towards the \(\lambda\)-return
Forward-view looks into the future to compute \(G^{\lambda}_t\)
Like MC, can only be computed from complete episodes
Forward view provides theory
Backward view provides mechanism
Update online, every step, from incomplete sequences
Credit assignment problem: did bell or light cause shock?
Frequency heuristic: assign credit to most frequent states
Recency heuristic: assign credit to most recent states
Eligibility traces combine both heuristics
\[\begin{align*} E_0(s, a) & = 0\\[2pt] E_t(s, a) & = \gamma \lambda E_{t-1}(s) + 1(\text{iff } (s_t, a_t) = (s, a)) \end{align*}\]
Keep an eligibility trace for every state-action pair \(E(s, a)\)
For every pair \((s, a)\), update \(Q(s, a)\) in proportion to TD-error \(\delta_t\) and eligibility trace \(E_t(s, a)\)
\[ \delta_t = (r_{t+1} + \gamma V(s_{t+1})) - Q(s_t, a_t) \]
\[ Q(s, a) \;\leftarrow\; Q(s, a) + \alpha \,\delta_t E_t(s, a) \]
When \(\lambda = 0\), only the current state-action pair is updated.
\[E_t(s, a) = \mathbf{1}(\text{iff } s_t = s, a_t = a)\] \[Q(s, a) \leftarrow Q(s, a) + \alpha\,\delta_t\,E_t(s, a)\]
This is exactly equivalent to the TD(\(0\)) update.
\[Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha\,\delta_t\]
When \(\lambda = 1\), credit is deferred until the end of the episode.
Theorem
The sum of offline updates is identical for forward-view and backward-view TD(\(\lambda\)):
\[ \sum_{t=1}^{T} \alpha\,\delta_t\,E_t(s, a) \;=\; \sum_{t=1}^{T} \alpha \bigl( G_t^{\lambda} - Q(s_t, a_t) \bigr)\,\mathbf{1}(s_t = s, a_t = a). \]