Appendix (Advanced Topics)

Slides

This module is also available in the following versions

Convergence & Contraction Mapping Theorem (Advanced Topic)

Convergence?

How do we know that value iteration converges to \(v_*\)?

Or that iterative policy evaluation converges to \(v_{\pi}\)?

And therefore that policy iteration converges to \(v_*\)?

Is the solution unique?

How fast do these algorithms converge?

These questions are resolved by contraction mapping theorem

Value Function Space

Consider the vector space \(\mathcal{V}\) over value functions

  • There are \(|\mathcal{S}|\) dimensions

  • Each point in this space fully specifies a value function \(v(s)\)

What does a Bellman backup do to points in this space?

  • We will show that it brings value functions closer,

  • \(\ldots\) and therefore the backups must converge on a unique solution

Value Function \(\infty\)-Norm

We will measure distance between state-value functions \(u\) and \(v\) by the \(\infty\)-norm

i.e. the largest difference between state values,

\[ \|u - v\|_{\infty} = \max_{s \in \mathcal{S}} |u(s) - v(s)| \]

Bellman Expectation Backup is a Contraction

Define the Bellman expectation backup operator \(T^{\pi}\)

\[ T^{\pi}(v) = \mathcal{R}^{\pi} + \gamma \mathcal{P}^{\pi} v \]

This operator is a \(\gamma\)-contraction, i.e. it makes value functions closer by at least \(\gamma\):

\[ \begin{align*} \|T^{\pi}(u) - T^{\pi}(v)\|_{\infty} &= \|(\mathcal{R}^{\pi} + \gamma \mathcal{P}^{\pi} u) - (\mathcal{R}^{\pi} + \gamma \mathcal{P}^{\pi} v)\|_{\infty} \\[4pt] &= \|\gamma \mathcal{P}^{\pi} (u - v)\|_{\infty} \\[4pt] &\le \|\gamma \mathcal{P}^{\pi}\|_{\infty} \, \|u - v\|_{\infty} \\[4pt] &\le \gamma \, \|u - v\|_{\infty} \end{align*} \]

Contraction Mapping Theorem

Theorem (Contraction Mapping Theorem)

For any metric space \(\mathcal{V}\) that is complete (i.e. closed) under an operator \(T(v)\), where \(T\) is a \(\gamma\)-contraction,

  • \(T\) converges to a unique fixed point

  • At a linear convergence rate of \(\gamma\)

Convergence of Policy Evaluation and Policy Iteration

The Bellman expectation operator \(T^{\pi}\) has a unique fixed point

\(v_{\pi}\) is a fixed point of \(T^{\pi}\) (by the Bellman expectation equation)

By the contraction mapping theorem

Iterative policy evaluation converges on \(v_{\pi}\)

Policy iteration converges on \(v_{*}\)

Bellman Optimality Backup is a Contraction

Define the Bellman optimality backup operator \(T^{*}\),

\[ T^{*}(v) = \max_{a \in \mathcal{A}} \left( \mathcal{R}^{a} + \gamma \mathcal{P}^{a} v \right) \]

This operator is a \(\gamma\)-contraction,
i.e. it makes value functions closer by at least \(\gamma\)
(similar to previous proof)

\[ \|T^{*}(u) - T^{*}(v)\|_{\infty} \le \gamma \|u - v\|_{\infty} \]

Convergence of Value Iteration

The Bellman optimality operator \(T^*\) has a unique fixed point

\(v_*\) is a fixed point of \(T^*\) (by Bellman optimality equation)

By contraction mapping theorem

Value iteration converges on \(v_*\)

Relationship Between Forward and Backward TD (Advanced Topic)

TD(\(\lambda\)) and TD(\(0\))

When \(\lambda = 0\), only the current state is updated

\[ \begin{align*} E_t(s) & = \mathbf{1}(S_t = s)\\[0pt] V(s) & \leftarrow V(s) + \alpha \delta_t E_t(s) \end{align*} \]

This is exactly equivalent to the TD(\(0\)) update

\[ V(S_t) \leftarrow V(S_t) + \alpha \delta_t \]

TD(\(\lambda\)) and MC(\(0\))

When \(\lambda = 1\), credit is deferred until the end of the episode

  • Consider episodic environments with offline updates

  • Over the course of an episode, total update for TD(\(1\)) is the same as total update for MC

Theorem

The sum of offline updates is identical for forward-view and backward-view TD(\(\lambda\))

\[ \sum_{t=1}^{T} \alpha \delta_t E_t(s) = \sum_{t=1}^{T} \alpha \left( G_t^{\lambda} - V(S_t) \right) \mathbf{1}(S_t = s) \]

MC and TD(\(1\))

Consider an episode where \(s\) is visited once at time-step \(k\)

  • TD(1) eligibility trace discounts time since visit:

\[ \begin{align*} E_t(s) & = \gamma E_{t-1}(s) + \mathbf{1}(S_t = s)\\[0pt] & = \begin{cases} 0, & \text{if } t < k \\ \gamma^{t-k}, & \text{if } t \ge k \end{cases} \end{align*} \]

  • TD(1) updates accumulate error online:

\[ \sum_{t=1}^{T-1} \alpha \delta_t E_t(s) = \alpha \sum_{t=k}^{T-1} \gamma^{t-k} \delta_t = \alpha (G_k - V(S_k)) \]

  • By the end of the episode, it accumulates total error:

\[ \delta_k + \gamma \delta_{k+1} + \gamma^2 \delta_{k+2} + \dots + \gamma^{T-1-k} \delta_{T-1} \]

Telescoping in TD(\(1\))

When \(\lambda = 1\), the sum of TD errors telescopes into the Monte Carlo (MC) error:

\[ \begin{align*} & \delta_t + \gamma \delta_{t+1} + \gamma^{2}\delta_{t+2} + \dots + \gamma^{T-1-t}\delta_{T-1}\\[0pt] & =\; R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\\[0pt] & +\; \gamma R_{t+2} + \gamma^{2} V(S_{t+2}) - \gamma V(S_{t+1})\\[0pt] & +\; \gamma^{2} R_{t+3} + \gamma^{3} V(S_{t+3}) - \gamma^{2} V(S_{t+2})\\[0pt] & \quad \quad \quad \vdots\\[0pt] & +\; \gamma^{T-1-t} R_T + \gamma^{T-t} V(S_T) - \gamma^{T-1-t} V(S_{T-1})\\[0pt] & =\; R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + \dots + \gamma^{T-1-t} R_T - V(S_t)\\[0pt] & =\; G_t - V(S_t) \end{align*} \]

TD(\(\lambda\)) and TD(\(1\))

TD(\(1\)) is roughly equivalent to every-visit Monte-Carlo

Error is accumulated online, step-by-step

If value function is only updated online at end of episode

Then total update is exactly the same as MC

Telescoping in TD(\(\lambda\))

For general \(\lambda\), TD errors also telescope to the \(\lambda\)-error, \(G_t^{\lambda} - V(S_t)\)

\[ \begin{align*} & G_t^{\lambda}-V(S_t)\hspace{-15mm} &=-V(S_t) &+(1-\lambda)\lambda^{0}(R_{t+1} + \gamma V(S_{t+1}))\\[0pt] & & &+(1-\lambda)\lambda^{1}(R_{t+1}+\gamma R_{t+2}+\gamma^{2} V(S_{t+2})) \\[0pt] & & & + (1 - \lambda)\lambda^{2}(R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + \gamma^{3} V(S_{t+3}))\\[0pt] & & &+ \dots \\[0pt] & &= -V(S_t) &+(\gamma\lambda)^{0}(R_{t+1}+\gamma V(S_{t+1})-\gamma\lambda V(S_{t+1})) \\[0pt] & & & + (\gamma\lambda)^{1}(R_{t+2} + \gamma V(S_{t+2}) - \gamma\lambda V(S_{t+2})) \\[0pt] & & & + (\gamma\lambda)^{2}(R_{t+3} + \gamma V(S_{t+3}) - \gamma\lambda V(S_{t+3})) \\[0pt] & & & + \dots \\[0pt] & &= \quad \quad \quad \quad & \quad (\gamma\lambda)^{0}(R_{t+1} + \gamma V(S_{t+1}) - V(S_t)) \\[0pt] & & & + (\gamma\lambda)^{1}(R_{t+2} + \gamma V(S_{t+2}) - V(S_{t+1})) \\[0pt] & & & + (\gamma\lambda)^{2}(R_{t+3} + \gamma V(S_{t+3}) - V(S_{t+2})) \\[0pt] & & & + \dots \\[0pt] & & & \hspace{-37mm} = \delta_t + \gamma\lambda \delta_{t+1} + (\gamma\lambda)^2 \delta_{t+2} + \dots \end{align*} \]

Forwards and Backwards TD(\(\lambda\))

Consider an episode where \(s\) is visited once at time-step \(k\)

TD(\(\lambda\)) eligibility trace discounts time since visit:

\[ \begin{align*} E_t(s) & = \gamma \lambda E_{t-1}(s) + \mathbf{1}(S_t = s)\\[0pt] & = \begin{cases} 0, & \text{if } t < k \\ (\gamma \lambda)^{t-k}, & \text{if } t \ge k \end{cases} \end{align*} \]

Backward TD(\(\lambda\)) updates accumulate error online:

\[ \sum_{t=1}^{T} \alpha \delta_t E_t(s) = \alpha \sum_{t=k}^{T} (\gamma \lambda)^{t-k} \delta_t = \alpha \left(G_k^{\lambda} - V(S_k)\right) \]

  • By the end of the episode, it accumulates total error for the \(\lambda\)-return

  • For multiple visits to \(s\), \(E_t(s)\) accumulates many errors

Offline Equivalence of Forward and Backward TD

Offline updates

  • Updates are accumulated within episode

  • but applied in batch at the end of episode

Online Equivalence of Forward and Backward TD

Online updates

  • TD(\(\lambda\)) updates are applied online at each step within episode

  • Forward and backward-view TD(\(\lambda\)) are slightly different

  • Exact online TD(\(\lambda\)) achieves perfect equivalence

  • By using a slightly different form of eligibility trace


Harm van Seijen, A. Rupam Mahmood, Patrick M. Pilarski, Marlos C. Machado, Richard S. Sutton; True Online Temporal-Difference Learning. Journal of Machine Learning Research (JMLR), 17(145):1−40, 2016.

Summary of Forward and Backward TD(\(\lambda\))

\[ \begin{array}{|c|c|c|c|} \hline {\color{red}{\text{Offline updates}}} & {\color{red}{\lambda = 0}} & {\color{red}{\lambda \in (0,1)}} & {\color{red}{\lambda = 1}} \\ \hline \text{Backward view} & \text{TD(0)} & \text{TD}(\lambda) & \text{TD(1)} \\[0pt] & || & || & || \\[0pt] \text{Forward view} & \text{TD(0)} & \text{Forward TD}(\lambda) & \text{MC} \\[0pt] \hline {\color{red}{\text{Online updates}}} & {\color{red}{\lambda = 0}} & {\color{red}{\lambda \in (0,1)}} & {\color{red}{\lambda = 1}} \\ \hline \text{Backward view} & \text{TD(0)} & \text{TD}(\lambda) & \text{TD(1)} \\[0pt] & || & \neq & \neq \\[0pt] \text{Forward view} & \text{TD(0)} & \text{Forward TD}(\lambda) & \text{MC} \\[0pt] & || & || & || \\[0pt] \text{Exact Online} & \text{TD(0)} & \text{Exact Online TD}(\lambda) & \text{Exact Online TD(1)} \\ \hline \end{array} \]

= here indicates equivalence in total update at end of episode.