How do we know that value iteration converges to \(v_*\)?
Or that iterative policy evaluation converges to \(v_{\pi}\)?
And therefore that policy iteration converges to \(v_*\)?
Is the solution unique?
How fast do these algorithms converge?
These questions are resolved by contraction mapping theorem
Consider the vector space \(\mathcal{V}\) over value functions
There are \(|\mathcal{S}|\) dimensions
Each point in this space fully specifies a value function \(v(s)\)
What does a Bellman backup do to points in this space?
We will show that it brings value functions closer,
\(\ldots\) and therefore the backups must converge on a unique solution
We will measure distance between state-value functions \(u\) and \(v\) by the \(\infty\)-norm
i.e. the largest difference between state values,
\[ \|u - v\|_{\infty} = \max_{s \in \mathcal{S}} |u(s) - v(s)| \]
Define the Bellman expectation backup operator \(T^{\pi}\)
\[ T^{\pi}(v) = \mathcal{R}^{\pi} + \gamma \mathcal{P}^{\pi} v \]
This operator is a \(\gamma\)-contraction, i.e. it makes value functions closer by at least \(\gamma\):
\[ \begin{align*} \|T^{\pi}(u) - T^{\pi}(v)\|_{\infty} &= \|(\mathcal{R}^{\pi} + \gamma \mathcal{P}^{\pi} u) - (\mathcal{R}^{\pi} + \gamma \mathcal{P}^{\pi} v)\|_{\infty} \\[4pt] &= \|\gamma \mathcal{P}^{\pi} (u - v)\|_{\infty} \\[4pt] &\le \|\gamma \mathcal{P}^{\pi}\|_{\infty} \, \|u - v\|_{\infty} \\[4pt] &\le \gamma \, \|u - v\|_{\infty} \end{align*} \]
Theorem (Contraction Mapping Theorem)
For any metric space \(\mathcal{V}\) that is complete (i.e. closed) under an operator \(T(v)\), where \(T\) is a \(\gamma\)-contraction,
\(T\) converges to a unique fixed point
At a linear convergence rate of \(\gamma\)
The Bellman expectation operator \(T^{\pi}\) has a unique fixed point
\(v_{\pi}\) is a fixed point of \(T^{\pi}\) (by the Bellman expectation equation)
By the contraction mapping theorem
Iterative policy evaluation converges on \(v_{\pi}\)
Policy iteration converges on \(v_{*}\)
Define the Bellman optimality backup operator \(T^{*}\),
\[ T^{*}(v) = \max_{a \in \mathcal{A}} \left( \mathcal{R}^{a} + \gamma \mathcal{P}^{a} v \right) \]
This operator is a \(\gamma\)-contraction,
i.e. it makes value functions closer by at least \(\gamma\)
(similar to previous proof)
\[ \|T^{*}(u) - T^{*}(v)\|_{\infty} \le \gamma \|u - v\|_{\infty} \]
The Bellman optimality operator \(T^*\) has a unique fixed point
\(v_*\) is a fixed point of \(T^*\) (by Bellman optimality equation)
By contraction mapping theorem
Value iteration converges on \(v_*\)
When \(\lambda = 0\), only the current state is updated
\[ \begin{align*} E_t(s) & = \mathbf{1}(S_t = s)\\[0pt] V(s) & \leftarrow V(s) + \alpha \delta_t E_t(s) \end{align*} \]
This is exactly equivalent to the TD(\(0\)) update
\[ V(S_t) \leftarrow V(S_t) + \alpha \delta_t \]
When \(\lambda = 1\), credit is deferred until the end of the episode
Consider episodic environments with offline updates
Over the course of an episode, total update for TD(\(1\)) is the same as total update for MC
Theorem
The sum of offline updates is identical for forward-view and backward-view TD(\(\lambda\))
\[ \sum_{t=1}^{T} \alpha \delta_t E_t(s) = \sum_{t=1}^{T} \alpha \left( G_t^{\lambda} - V(S_t) \right) \mathbf{1}(S_t = s) \]
Consider an episode where \(s\) is visited once at time-step \(k\)
\[ \begin{align*} E_t(s) & = \gamma E_{t-1}(s) + \mathbf{1}(S_t = s)\\[0pt] & = \begin{cases} 0, & \text{if } t < k \\ \gamma^{t-k}, & \text{if } t \ge k \end{cases} \end{align*} \]
\[ \sum_{t=1}^{T-1} \alpha \delta_t E_t(s) = \alpha \sum_{t=k}^{T-1} \gamma^{t-k} \delta_t = \alpha (G_k - V(S_k)) \]
\[ \delta_k + \gamma \delta_{k+1} + \gamma^2 \delta_{k+2} + \dots + \gamma^{T-1-k} \delta_{T-1} \]
When \(\lambda = 1\), the sum of TD errors telescopes into the Monte Carlo (MC) error:
\[ \begin{align*} & \delta_t + \gamma \delta_{t+1} + \gamma^{2}\delta_{t+2} + \dots + \gamma^{T-1-t}\delta_{T-1}\\[0pt] & =\; R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\\[0pt] & +\; \gamma R_{t+2} + \gamma^{2} V(S_{t+2}) - \gamma V(S_{t+1})\\[0pt] & +\; \gamma^{2} R_{t+3} + \gamma^{3} V(S_{t+3}) - \gamma^{2} V(S_{t+2})\\[0pt] & \quad \quad \quad \vdots\\[0pt] & +\; \gamma^{T-1-t} R_T + \gamma^{T-t} V(S_T) - \gamma^{T-1-t} V(S_{T-1})\\[0pt] & =\; R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + \dots + \gamma^{T-1-t} R_T - V(S_t)\\[0pt] & =\; G_t - V(S_t) \end{align*} \]
TD(\(1\)) is roughly equivalent to every-visit Monte-Carlo
Error is accumulated online, step-by-step
If value function is only updated online at end of episode
Then total update is exactly the same as MC
For general \(\lambda\), TD errors also telescope to the \(\lambda\)-error, \(G_t^{\lambda} - V(S_t)\)
\[ \begin{align*} & G_t^{\lambda}-V(S_t)\hspace{-15mm} &=-V(S_t) &+(1-\lambda)\lambda^{0}(R_{t+1} + \gamma V(S_{t+1}))\\[0pt] & & &+(1-\lambda)\lambda^{1}(R_{t+1}+\gamma R_{t+2}+\gamma^{2} V(S_{t+2})) \\[0pt] & & & + (1 - \lambda)\lambda^{2}(R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + \gamma^{3} V(S_{t+3}))\\[0pt] & & &+ \dots \\[0pt] & &= -V(S_t) &+(\gamma\lambda)^{0}(R_{t+1}+\gamma V(S_{t+1})-\gamma\lambda V(S_{t+1})) \\[0pt] & & & + (\gamma\lambda)^{1}(R_{t+2} + \gamma V(S_{t+2}) - \gamma\lambda V(S_{t+2})) \\[0pt] & & & + (\gamma\lambda)^{2}(R_{t+3} + \gamma V(S_{t+3}) - \gamma\lambda V(S_{t+3})) \\[0pt] & & & + \dots \\[0pt] & &= \quad \quad \quad \quad & \quad (\gamma\lambda)^{0}(R_{t+1} + \gamma V(S_{t+1}) - V(S_t)) \\[0pt] & & & + (\gamma\lambda)^{1}(R_{t+2} + \gamma V(S_{t+2}) - V(S_{t+1})) \\[0pt] & & & + (\gamma\lambda)^{2}(R_{t+3} + \gamma V(S_{t+3}) - V(S_{t+2})) \\[0pt] & & & + \dots \\[0pt] & & & \hspace{-37mm} = \delta_t + \gamma\lambda \delta_{t+1} + (\gamma\lambda)^2 \delta_{t+2} + \dots \end{align*} \]
Consider an episode where \(s\) is visited once at time-step \(k\)
TD(\(\lambda\)) eligibility trace discounts time since visit:
\[ \begin{align*} E_t(s) & = \gamma \lambda E_{t-1}(s) + \mathbf{1}(S_t = s)\\[0pt] & = \begin{cases} 0, & \text{if } t < k \\ (\gamma \lambda)^{t-k}, & \text{if } t \ge k \end{cases} \end{align*} \]
Backward TD(\(\lambda\)) updates accumulate error online:
\[ \sum_{t=1}^{T} \alpha \delta_t E_t(s) = \alpha \sum_{t=k}^{T} (\gamma \lambda)^{t-k} \delta_t = \alpha \left(G_k^{\lambda} - V(S_k)\right) \]
By the end of the episode, it accumulates total error for the \(\lambda\)-return
For multiple visits to \(s\), \(E_t(s)\) accumulates many errors
Offline updates
Updates are accumulated within episode
but applied in batch at the end of episode
Online updates
TD(\(\lambda\)) updates are applied online at each step within episode
Forward and backward-view TD(\(\lambda\)) are slightly different
Exact online TD(\(\lambda\)) achieves perfect equivalence
By using a slightly different form of eligibility trace
Harm van Seijen, A. Rupam Mahmood, Patrick M. Pilarski, Marlos C. Machado, Richard S. Sutton; True Online Temporal-Difference Learning. Journal of Machine Learning Research (JMLR), 17(145):1−40, 2016.
\[ \begin{array}{|c|c|c|c|} \hline {\color{red}{\text{Offline updates}}} & {\color{red}{\lambda = 0}} & {\color{red}{\lambda \in (0,1)}} & {\color{red}{\lambda = 1}} \\ \hline \text{Backward view} & \text{TD(0)} & \text{TD}(\lambda) & \text{TD(1)} \\[0pt] & || & || & || \\[0pt] \text{Forward view} & \text{TD(0)} & \text{Forward TD}(\lambda) & \text{MC} \\[0pt] \hline {\color{red}{\text{Online updates}}} & {\color{red}{\lambda = 0}} & {\color{red}{\lambda \in (0,1)}} & {\color{red}{\lambda = 1}} \\ \hline \text{Backward view} & \text{TD(0)} & \text{TD}(\lambda) & \text{TD(1)} \\[0pt] & || & \neq & \neq \\[0pt] \text{Forward view} & \text{TD(0)} & \text{Forward TD}(\lambda) & \text{MC} \\[0pt] & || & || & || \\[0pt] \text{Exact Online} & \text{TD(0)} & \text{Exact Online TD}(\lambda) & \text{Exact Online TD(1)} \\ \hline \end{array} \]
= here indicates equivalence in total update at end of episode.