Reinforcement learning can be used to solve large problems, e.g.
Backgammon: \(10^{20}\) states
Computer Go: \(10^{170}\) states
The number of distinct 1-minute, 60 fps, 4K, 24-bit videos: \(\approx 10^{10,216,000,000,000}\)
Humanoid Robot: continuous state space, uncountable number of states
How can we scale up the model-free methods for prediction and control using generalisation?
So far we have represented value function by a lookup table
Every state \(s\) has an entry \(V(s)\),
or every state-action pair \((s,a)\) has an entry \(Q(s,a)\), sufficient to do control, however \(\ldots\)
Problem with large MDPs
Too many states and/or actions to store in memory
Too slow to learn the value of each state individually
Solution for large MDPs
Estimate value function with function approximation
\[\begin{align*} \hat{v}(s, {\color{red}{\mathbf{w}}})\; & {\color{blue}{\approx}}\; v_{\pi}(s)\\[0pt] \text{or}\;\;\;\hat{q}(s, a, {\color{red}{\mathbf{w}}})\; & {\color{blue}{\approx}}\; q_{\pi}(s, a) \end{align*}\]
Generalises from seen states to unseen states
Update parameter \({\color{red}{\mathbf{w}}}\) (a vector) using MC or TD learning
Left: value function approximation tells us how much reward we will get at some input state. Middle: action-in value function approximation. Right: action-out value function approximation (over all actions, e.g. Atari)
Linear combinations of features
Neural networks
Decision trees
Nearest neighbours
Fourier / wavelet bases
\(\cdots\)
We consider differentiable function approximators, e.g.
Linear combinations of features
Neural networks (sub-symbolic e.g. Convolution NNs (CNNs), Recurrent NNs (RNNS), transformers etc.)
Decision trees (symbolic e.g. if-then-else rules, not differentiable)
Nearest neighbours (e.g. clustering in ML, not differentiable)
Fourier / wavelet bases (e.g. transform coding in video compression not differentiable)
\(\cdots\)
Furthermore, we require a training method that is suitable for
non-stationary
non-iid (independent and identically distributed) data
Let \(J(\mathbf{w})\) be a differentiable objective function of parameter vector \(\mathbf{w}\)
Define the gradient of \(J(\mathbf{w})\) to be
\[
{\color{red}{\nabla_{\textbf{w}} J(\mathbf{w})}} =
\begin{pmatrix}
\frac{\partial J(\mathbf{w})}{\partial \mathbf{w}_1} \\
\vdots \\
\frac{\partial J(\mathbf{w})}{\partial \mathbf{w}_n}
\end{pmatrix}
\]
To find a local minimum of \(J(\mathbf{w})\)
Adjust \(\mathbf{w}\) in direction of negative gradient
\(\;\;\;\;\;\;\Delta \mathbf{w} = -\tfrac{1}{2}\alpha \nabla_{\mathbf{w}} J(\mathbf{\mathbf{w}})\)
i.e. move downhill (\(\nabla\) is differential operator for vectors and \(\alpha\) is a step-size parameter)
Goal: find parameter vector \(\mathbf{w}\) minimising mean-squared error between approximate value function \(\hat{v}(s, \mathbf{w})\) and true value function \(v^\pi(s)\)
\[ J(\mathbf{w}) = \mathbb{E}_\pi \left[ (v_\pi(S) - \hat{v}(S, \mathbf{w}))^2 \right]\;\;\;\; {\color{blue}{\text{imagine an oracle knows}\ v_{\pi}(S)}} \]
Gradient descent finds a local minimum
\[\begin{align*} \Delta \mathbf{w} & = -\tfrac{1}{2}\alpha \nabla_\mathbf{w} J(\mathbf{w})\\[0pt] & = \alpha \, \mathbb{E}_\pi \Big[(v_\pi(S) - \hat{v}(S, \mathbf{w})) \nabla_\mathbf{w} \hat{v}(S, \mathbf{w})\Big] \end{align*}\]
Stochastic gradient descent samples the gradient for states
\[
\Delta \mathbf{w} = \alpha (v_\pi(S) - \hat{v}(S, \mathbf{w})) \nabla_\mathbf{w} \hat{v}(S, \mathbf{w})
\]
Expected update equals the full gradient update.
However, this is essentially supervised learning, as we require an oracle
Represent state by a feature vector
\[ \mathbf{x}(S) = \begin{pmatrix} \mathbf{x}_1(S) \\ \vdots \\ \mathbf{x}_n(S) \end{pmatrix} \]
For Example:
Distance of robot from landmarks
Trends in the stock market
Piece and pawn configurations in chess
Represent value function by a linear combination (weighted sum) of features:
\[
\hat{v}(S, \mathbf{w}) = \mathbf{x}(S)^\top \mathbf{w} = \sum_{j=1}^n \mathbf{x}_j(S) \mathbf{w}_j
\]
Objective function is quadratic in \(\mathbf{w}\):
\[
J(\mathbf{w}) = \mathbb{E}_{\pi} \big[ (v_{\pi}(S) - \mathbf{x}(S)^\top \mathbf{w})^2 \big]
\]
This means it is convex which means it is easy to optimise–we can be confident we won’t get stuck in local minima
Stochastic gradient descent converges to the global optimum.
Update rule is simple
\[ \begin{align*} \nabla_{\mathbf{w}} \hat{v}(S, \mathbf{w}) &= \mathbf{x}(S)\\[0pt] \Delta \mathbf{w} &= \alpha \big(v_{\pi}(S) - \hat{v}(S, \mathbf{w})\big)\,\mathbf{x}(S) \end{align*} \]
Update = step-size × prediction error × feature value.
\({\color{blue}{\cdots}}\;\)with respect to a supervisor or an oracle
Table lookup is a special case of linear value function approximation
Using table lookup features
\[
\mathbf{x}^{\text{table}}(S) =
\begin{pmatrix}
\mathbf{1}(S = s_1) \\
\vdots \\
\mathbf{1}(S = s_n)
\end{pmatrix}
\]
Parameter vector \(\mathbf{w}\) gives value of each individual state \[ \hat{v}(S, \mathbf{w}) = \begin{pmatrix} \mathbf{1}(S = s_1) \\ \vdots \\ \mathbf{1}(S = s_n) \end{pmatrix} \cdot \begin{pmatrix} \mathbf{w}_1 \\ \vdots \\ \mathbf{w}_n \end{pmatrix} \]
So far assumed true value function \(v_\pi(s)\) is given by a supervisor
But in RL there is no supervisor, only rewards
In practice, substitute a target for \(v_\pi(s)\), instead of an oracle, directly from rewards we get from experience
Monte Carlo: target = return \(G_t\)
\[
\Delta \mathbf{w} = \alpha ({\color{red}{G_t}} - \hat{v}(S_t, \mathbf{w})) \nabla_\mathbf{w} \hat{v}(S_t, \mathbf{w})
\]
TD(0): target = \(R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w})\)
\[
\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\Delta \mathbf{w} = \alpha ({\color{red}{R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w})}} - \hat{v}(S_t, \mathbf{w})) \nabla_\mathbf{w} \hat{v}(S_t, \mathbf{w})
\]
TD(\(\lambda\)): target = \(\lambda\)-return \(G_t^\lambda\)
\[
\Delta \mathbf{w} = \alpha ({\color{red}{G_t^\lambda}} - \hat{v}(S_t, \mathbf{w})) \nabla_\mathbf{w} \hat{v}(S_t, \mathbf{w})
\]
Return \(G_t\) is an unbiased, noisy sample of true value \(v_\pi(S_t)\)
Can therefore apply supervised learning to “training data” incrementally, derived from Monte Carlo episodes (roll outs) performed at each state \[ \langle S_1, G_1 \rangle, \langle S_2, G_2 \rangle, \ldots, \langle S_T, G_T \rangle \]
For example:, linear Monte-Carlo policy evaluation \[\begin{align*} \Delta \mathbf{w} & = \alpha ({\color{red}{G_t}} - \hat{v}(S_t, \mathbf{w})) \nabla_\mathbf{w} \hat{v}(S_t, \mathbf{w})\\[0pt] & = \alpha (G_t - \hat{v}(S_t, \mathbf{w})) \mathbf{x}(S_t) \end{align*}\]
Monte-Carlo evaluation converges to a local optimum, even for non-linear function approximation
TD-target \(R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w})\) is a biased sample of true value \(v_\pi(S_t)\)
Can still apply supervised learning to “training data”:
\[
\langle S_1, R_2 + \gamma \hat{v}(S_2, \mathbf{w}) \rangle,
\langle S_2, R_3 + \gamma \hat{v}(S_3, \mathbf{w}) \rangle,
\ldots, \langle S_{T-1}, R_T \rangle
\]
For example, using linear TD\((0)\)
\[\begin{align*}
\Delta \mathbf{w} & = \alpha ({\color{red}{R + \gamma \hat{v}(S^{\prime}, \mathbf{w})}} - \hat{v}(S, \mathbf{w})) \nabla_\mathbf{w} \hat{v}(S, \mathbf{w})\\[0pt]
& = \alpha \delta \, \mathbf{x}(S)
\end{align*}\]
Use TD error \({\color{red}{R + \gamma \hat{v}(S^{\prime}, \mathbf{w})}} - \hat{v}(S, \mathbf{w})\) to update function approximator at each step
Linear TD\((0)\) converges close to the global optimum
The \(\lambda\)-return \(G_t^\lambda\) is also a biased sample of true value \(v_\pi(s)\)
Can again supervised learning to “training data”:
\[
\langle S_1, G^{\lambda}_1 \rangle,
\langle S_2, G^{\lambda}_2 \rangle,
\ldots,
\langle S_{T-1}, G^{\lambda}_{T-1} \rangle
\]
Forward view linear TD\((\lambda)\):
\[\begin{align*}
\Delta \mathbf{w} & = \alpha ({\color{red}{G_t^\lambda}} - \hat{v}(S_t, \mathbf{w})) \nabla_\mathbf{w} \hat{v}(S_t, \mathbf{w}) \\[0pt]
& = \alpha ({\color{red}{G_t^\lambda}} - \hat{v}(S_t, w)) x(S_t)
\end{align*}\]
Backward view linear TD\((\lambda)\) using eligibility traces over features \(\textbf{x}\):
\[\begin{align*}
\delta_t & = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w}) \\[0pt]
E_t & = \gamma \lambda E_{t-1} + \mathbf{x}(S_t)\\[0pt]
\Delta \mathbf{w} & = \alpha \delta_t E_t
\end{align*}\]
Forward and backward views are equivalent.
Policy evaluation approximate policy evaluation, \(\hat{q}(\cdot,\cdot,\mathbf{w}) \approx q_\pi\)
Policy improvement \(\epsilon\)-greedy policy improvement
Approximate the action-value function:
\[
\hat{q}(S, A, \mathbf{w}) \approx q_\pi(S, A)
\]
Minimise mean-squared error between approximate action-value function, \(\hat{q}(S, A, \mathbf{w})\), and the true action-value function, \(q_\pi(S, A)\)
\[ J(\mathbf{w}) = \mathbb{E}_\pi \big[ (q_\pi(S, A) - \hat{q}(S, A, \mathbf{w}))^2 \big] \]
Stochastic gradient descent:
\[\begin{align*}
-\tfrac{1}{2}\nabla_\mathbf{w} J(\mathbf{w}) & = (q_\pi(S,A) - \hat{q}(S,A,\mathbf{w})) \nabla_\mathbf{w} \hat{q}(S,A,\mathbf{w})\\[0pt]
\Delta \mathbf{w} & = \alpha (q_\pi(S, A) - \hat{q}(S, A, \mathbf{w})) \nabla_\mathbf{w} \hat{q}(S, A, \mathbf{w})
\end{align*}\]
Represent state and action by feature vector:
\[
\mathbf{x}(S, A) =
\begin{pmatrix}
\mathbf{x}_1(S, A) \\
\vdots \\
\mathbf{x}_n(S, A)
\end{pmatrix}
\]
Linear action-value function:
\[
\hat{q}(S, A, \textbf{w}) = \mathbf{x}(S, A)^\top \textbf{w} = \sum_{j=1}^n \mathbf{x}_j(S, A) \textbf{w}_j
\]
Stochastic gradient descent update:
\[\begin{align*}
\nabla_\textbf{w} \hat{q}(S, A, \textbf{w}) & = \mathbf{x}(S, A)\\[0pt]
\Delta \textbf{w} & = \alpha (q^\pi(S,A) - \hat{q}(S,A,\textbf{w})) \, \textbf{x}(S, A)
\end{align*}\]
Like prediction, we must substitute a target for \(q_\pi(S,A)\)
For Monte Carlo, the target is the return \(G_t\)
\[ \Delta \textbf{w} = \alpha ({\color{red}{G_t}} - \hat{q}(S_t, A_t, \textbf{w})) \nabla_\textbf{w} \hat{q}(S_t, A_t, \textbf{w}) \]
For TD(0), the target is the TD target \(R_{t+1} + \gamma Q(S_{t+1}, A_{t+1})\)
\[ \Delta \mathbf{w} = \alpha ({\color{red}{R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w})}} - \hat{q}(S_t, A_t, \mathbf{w})) \nabla_\mathbf{w} \hat{q}(S_t, A_t, \mathbf{w}) \]
For Forward-view TD(\(\lambda\)), target is the action-value \(\lambda\)-return
\[ \Delta w = \alpha (q_t^\lambda - \hat{q}(S_t, A_t, w)) \nabla_w \hat{q}(S_t, A_t, w) \]
for Backward-view TD(\(\lambda\)), the equivalent update is
\[\begin{align*} \delta_t & = R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}) - \hat{q}(S_t, A_t, \mathbf{w})\\[0pt] E_t & = \gamma \lambda E_{t-1} + \nabla_\mathbf{w} \hat{q}(S_t, A_t, \mathbf{w})\\[0pt] \Delta \mathbf{w} & = \alpha \delta_t E_t \end{align*}\]
The reason we are using \(q\) instead of \(v\) is so we can do control
Car isn’t powerful enough to drive up the mountain—it needs to drive forwards, roll backwards, drive forwards, etc. until it reaches goal
Features in vector \(\textbf{x}\) are position and velocity
Updates \(q\) every step using Sarsa returns
We should always bootstrap (using eligibility traces) and there is a sweet spot for\(\;{\color{blue}{\lambda}}\) (ignore distinction between accumulating versus replacing traces)
\[ \begin{array}{l l c c c} \hline \text{On/Off-Policy} & \text{Algorithm} & \text{Table Lookup} & \text{Linear} & \text{Non-Linear} \\ \hline \text{On-Policy} & \textit{MC} & \checkmark & \checkmark & \checkmark \\ & TD(0) & \checkmark & \checkmark & \text{✗} \\ & TD(\lambda) & \checkmark & \checkmark & \text{✗} \\ \hline \text{Off-Policy} & \textit{MC} & \checkmark & \checkmark & \checkmark \\ & TD(0) & \checkmark & \text{✗} & \text{✗} \\ & TD(\lambda) & \checkmark & \text{✗} & \text{✗} \\ \hline \end{array} \]
TD does not follow the gradient of any objective function
This is why TD can diverge when off-policy or with non-linear function approximation.
Gradient TD follows true gradient of projected Bellman error using a correction term
\[ \begin{array}{l l c c c} \hline \text{On/Off-Policy} & \text{Algorithm} & \text{Table Lookup} & \text{Linear} & \text{Non-Linear} \\ \hline \text{On-Policy} & \textit{MC} & \checkmark & \checkmark & \checkmark \\ & TD(0) & \checkmark & \checkmark & \text{✗} \\ & \textit{\color{red}{Gradient TD}} & {\color{red}{\checkmark}} & {\color{red}{\checkmark}} & {\color{red}{\checkmark}} \\ \hline \text{Off-Policy} & \textit{MC} & \checkmark & \checkmark & \checkmark \\ & TD(0) & \checkmark & \text{✗} & \text{✗} \\ & \textit{\color{red}{Gradient TD}} & {\color{red}{\checkmark}} & {\color{red}{\checkmark}} & {\color{red}{\checkmark}} \\ \hline \end{array} \]
A principled successor of the Gradient TD technique is Emphatic TD
\[ \begin{array}{l l c c c} \hline \text{Algorithm} & \text{Table Lookup} & \text{Linear} & \text{Non-Linear} \\ \hline \textit{Monte-Carlo Control} & \checkmark & (\checkmark) & \text{✗} \\ \textit{Sarsa} & \checkmark & (\checkmark) & \text{✗} \\ \textit{Q-Learning} & \checkmark & \text{✗} & \text{✗} \\ \textit{\color{red}{Gradient Q-Learning}} & \checkmark & \checkmark & \text{✗} \\ \hline \end{array} \] (\(\checkmark\)) = chatters (oscillates) around near-optimal value function.
Gradient descent is simple and appealing.
Batch methods seek to find the best fitting value function
Given value function approximation \(\hat{v}(s, \mathbf{w}) \approx v_\pi(s)\), and
experience \(D\) consisting of \(\langle \mathit{state, value} \rangle\) pairs
\[ D = \{ \langle s_1, v^\pi_1 \rangle, \ldots, \langle s_T, v^\pi_T \rangle \} \]
Which parameters \(\mathbf{w}\) give the best fitting value function \(\hat{v}(s,\mathbf{w})\) for the entire data set?
Least squares algorithms find parameter vector \(\mathbf{w}\) by minimising sum-squared error between \(\hat{v}(s_t,\mathbf{w})\) and target values \(v^{\pi}_t\)
\[\begin{align*} LS(\mathbf{w}) & = \sum_{t=1}^T (v^\pi_t - \hat{v}(s_t, \mathbf{w}))^2\\[0pt] & = \mathbb{E}_D \big[ (v^\pi - \hat{v}(s,\mathbf{w}))^2 \big] \end{align*}\]
Given experience of \(\langle \mathit{state, value} \rangle\) pairs, that we store
\[ D = \{ \langle s_1, v^\pi_1 \rangle, \ldots, \langle s_T, v^\pi_T \rangle \} \]
Stochastic Gradient Descent with Experience Replay
Repeat:
1. Sample states, values randomly from stored experience
\[ \langle s, v^\pi\rangle \sim D \]
\[ \Delta \mathbf{w} = \alpha (v^\pi - \hat{v}(s,\mathbf{w})) \nabla_\mathbf{w} \hat{v}(s,\mathbf{w}) \]
Converges to least squares solution: \(\mathbf{w}^\pi = \arg\min_\mathbf{w} LS(\mathbf{w})\)
DQN uses experience replay and fixed Q-targets (off-policy)
\[ L_i(\mathbf{w}_i) =\underbrace{\mathbb{E}_{(s,a,r,s')\sim \mathcal{D}_i}}_{\text{sampled from}\;D_i} \Big[ \big( \underbrace{r+\gamma \max_{a'} Q(s',a';\mathbf{w}^{-}_i)}_{\text{TD target }} - \underbrace{Q(s,a;\mathbf{w}_i)}_{\text{Q estimate}} \big)^2 \Big] \]
Experience replay decorrelates trajectories
Fixed Q-targets: Keep two different networks
use “frozen” target for target updates
switch the two networks around after a fixed number of steps (e.g. 1000 steps)
this helps to stabilise convergence to non-linearity
End-to-end learning of values \({\color{red}{Q(s,a)}}\) from pixels \({\color{blue}{s}}\).
Input: stack of raw pixels (last 4 frames)
Output: \(Q(s,a)\) for 18 joystick/button positions
Reward: change in score for that step
Network architecture + hyper-parameters fixed across all games
| Game | Replay + Fixed-Q | Replay + Q-learning | No replay + Fixed-Q | No replay + Q-learning |
|---|---|---|---|---|
| Breakout | 316.81 | 240.73 | 10.16 | 3.17 |
| Enduro | 1006.30 | 831.25 | 141.89 | 29.10 |
| River Raid | 7446.62 | 4102.81 | 2867.66 | 1453.02 |
| Seaquest | 2894.40 | 822.55 | 1003.00 | 275.81 |
| Space Invaders | 1088.94 | 826.33 | 373.22 | 301.99 |
Experience replay finds the least squares solution
Using the special case of linear value function approximation
\[ \hat{v}(s,\mathbf{w}) = x(s)^\top \mathbf{w} \]
We can solve the least squares solution directly using a closed form
At the minimum of \(LS(\mathbf{w})\), the expected update must be zero:
\[\begin{align*} \mathbb{E}_{\mathcal{D}} \left[ \Delta \mathbf{w} \right] &= 0\;\;\; \text{(want update zero across data set)} \\[0pt] \alpha \sum_{t=1}^{T} \mathbf{x}(s_t) \bigl( v_t^{\pi} - \mathbf{x}(s_t)^{\top} \mathbf{w} \bigr) &= 0\;\;\; \text{(unwrapping expected updates)}\\[0pt] \sum_{t=1}^{T} \mathbf{x}(s_t) v_t^{\pi} &= \sum_{t=1}^{T} \mathbf{x}(s_t) \mathbf{x}(s_t)^{\top} \mathbf{w} \\[0pt] \mathbf{w} &= \left( \sum_{t=1}^{T} \mathbf{x}(s_t) \mathbf{x}(s_t)^{\top} \right)^{-1} \sum_{t=1}^{T} \mathbf{x}(s_t) v_t^{\pi} \end{align*}\]
Compute time: \(O(N^3)\) for \(N\) features (direct),
or incremental \(O(N^2)\) using Sherman–Morrison.
We do not know true values \(v_t^\pi\)
\[\begin{align*} {\color{blue}{\text{LSMC}}} \;\;\;\;&\text{Least Squares Monte-Carlo uses return}\\[0pt] & v_t^\pi \approx {\color{red}{G_t}} \\[0pt] {\color{blue}{\text {LSTD}}} \;\;\;\; & \text{Least Squares Temporal-Difference uses TD target}\\[0pt] & v_t^\pi \approx {\color{red}{R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w})}}\\[0pt] {\color{blue}{\text{LSTD}}}{\color{blue}{(\lambda)}}\;\;\;\; & \text{Least Squares TD}(\lambda)\text{ uses}\;\lambda\text{-return} \\[0pt] & v_t^\pi \approx {\color{red}{G_t^\lambda}} \end{align*}\]
In each case we can solve directly for the fixed point of MC / TD / TD(\(\lambda\))
\[\begin{align*} {\color{red}{\text{LSMC}}} \;\;\;\; 0 & = \sum_{t=1}^T \alpha \big(G_t - \hat{v}(S_t,\mathbf{w})\big)\, \textbf{x}(S_t),\qquad\\[0pt] \mathbf{w} & = \left(\sum_{t=1}^T \textbf{x}(S_t) \textbf{x}(S_t)^\top \right)^{-1} \left(\sum_{t=1}^T \textbf{x}(S_t) G_t \right)\\[0pt] {\color{red}{\text{LSTD}}} \;\;\;\; 0 & = \sum_{t=1}^T \alpha \big(R_{t+1} + \gamma \hat{v}(S_{t+1},\mathbf{w}) - \hat{v}(S_t,\mathbf{w})\big)\, \textbf{x}(S_t)\\[0pt] \mathbf{w} & = \left(\sum_{t=1}^T \textbf{x}(S_t)\big(x(S_t) - \gamma \textbf{x}(S_{t+1})\big)^\top \right)^{-1} \left(\sum_{t=1}^T \textbf{x}(S_t) R_{t+1}\right)\\[0pt] {\color{red}{\text{LSTD}}}{\color{red}{(\lambda)}}\;\;\;\; 0 & = \sum_{t=1}^T \alpha\, \delta_t\, E_t,\qquad\\[0pt] \mathbf{w} & = \left(\sum_{t=1}^T E_t \big(\textbf{x}(S_t) - \gamma \textbf{x}(S_{t+1})\big)^\top \right)^{-1} \left(\sum_{t=1}^T E_t R_{t+1}\right) \end{align*}\]
\[ \begin{array}{l l c c c} \hline \text{On/Off-Policy} & \text{Algorithm} & \text{Table Lookup} & \text{Linear} & \text{Non-Linear} \\ \hline \text{On-Policy} & \textit{MC} & \checkmark & \checkmark & \checkmark \\ & {\color{red}{\textit{LSMC}}} & {\color{red}{\checkmark}} & {\color{red}{\checkmark}} & {\color{red}{-}} \\ & \textit{TD} & \checkmark & \checkmark & \text{✗} \\ & {\color{red}{\textit{LSTD}}} & {\color{red}{\checkmark}} & {\color{red}{\checkmark}} & {\color{red}{-}} \\ \hline \text{Off-Policy} & \textit{MC} & \checkmark & \checkmark & \checkmark \\ & {\color{red}{\textit{LSMC}}} & {\color{red}{\checkmark}} & {\color{red}{\checkmark}} & {\color{red}{-}} \\ & \textit{TD} & \checkmark & \text{✗} & \text{✗} \\ & {\color{red}{\textit{LSTD}}} & {\color{red}{\checkmark}} & {\color{red}{\checkmark}} & {\color{red}{-}} \\ \hline \end{array} \]
Policy evaluation Policy evaluation by least-squares Q-learning
Policy improvement Greedy policy improvement
Approximate action-value function \(q_\pi(s, a)\)
\[ \hat{q}(s, a, \mathbf{w}) = \mathbf{x}(s, a)^\top \mathbf{w} \;\approx\; q_\pi(s, a) \]
Minimise least squares error between \(\hat{q}(s, a, \mathbf{w})\) and \(q_\pi(s, a)\)
from experience generated using policy \(\pi\)
consisting of \(\langle (state, action), value \rangle\) pairs
\[ \mathcal{D} = \Bigl\{ \langle (s_1, a_1), v_1^\pi \rangle,\; \langle (s_2, a_2), v_2^\pi \rangle,\; \ldots,\; \langle (s_T, a_T), v_T^\pi \rangle \Bigr\} \]
For policy evaluation, we want to efficiently use all experience
For control, we also want to improve the policy
The experience is generated from many policies
So to evaluate \(q_\pi(S,A)\) we must learn off-policy
We use the same idea as Q-learning:
Consider the following linear Q-learning update
\[\begin{align*} \delta &= R_{t+1} + \gamma \hat{q}(S_{t+1}, {\color{red}{\pi(S_{t+1})}}, \mathbf{w}) - \hat{q}(S_t, A_t, \mathbf{w}) \\[0.5em] \Delta \mathbf{w} &= \alpha \, \delta \, \mathbf{x}(S_t, A_t) \end{align*}\]
LSTDQ algorithm: solve for total update \(=\) zero
\[\begin{align*} 0 &= \sum_{t=1}^T \alpha \Big( R_{t+1} + \gamma \hat{q}(S_{t+1}, {\color{red}{\pi(S_{t+1})}}, \mathbf{w}) - \hat{q}(S_t, A_t, \mathbf{w}) \Big)\mathbf{x}(S_t, A_t) \\[1em] \mathbf{w} &= \Bigg( \sum_{t=1}^T \mathbf{x}(S_t, A_t)\big(\mathbf{x}(S_t, A_t) - \gamma \mathbf{x}(S_{t+1}, \pi(S_{t+1}))\big)^\top \Bigg)^{-1} \sum_{t=1}^T \mathbf{x}(S_t, A_t) R_{t+1} \end{align*}\]
The following pseudo-code uses LSTDQ for policy evaluation
It repeatedly re-evaluates experience \(\mathcal{D}\) with different policies
Least Squares Policy Iteration Algorithm (LSTFQ variant)
\[ \begin{aligned} &\text{function LSPI-TD}(\mathcal{D}, \pi_0) \\ &\quad \pi^{\prime} \leftarrow \pi_0 \\ &\quad \text{repeat} \\ &\qquad \pi \leftarrow \pi^{\prime} \\ &\qquad Q \leftarrow \text{LSTDQ}(\pi,\mathcal{D}) \\ &\qquad \text{for all } s \in S \text{ do} \\ &\qquad\quad \pi^{\prime}(s) \leftarrow \arg\max_{a \in A} Q(s,a) \\ &\qquad \text{end for} \\ &\quad \text{until } (\pi \approx \pi^{\prime}) \\ &\quad \text{return } \pi \\ &\text{end function} \end{aligned} \]
\[ \begin{array}{l l c c c} \hline \text{Algorithm} & \text{Table Lookup} & \text{Linear} & \text{Non-Linear} \\ \hline \textit{Monte-Carlo Control} & \checkmark & (\checkmark) & \text{✗} \\ \textit{Sarsa} & \checkmark & (\checkmark) & \text{✗} \\ \textit{Q-Learning} & \checkmark & \text{✗} & \text{✗} \\ \textit{\color{red}{LSPI}} & \checkmark & (\checkmark) & - \\ \hline \end{array} \]
(\(\checkmark\)) = chatters around near-optimal value function.
Consider the 50 state version of this problem (bigger replica of this diagram)
Reward \(+1\) in states \(10\) and \(41\), \(0\) elsewhere
Optimal policy: R (\(1-9\)), L (\(10-25\)), R (\(26-41\)), L (\(42, 50\))
Features: \(10\) evenly spaced Gaussians (\(\sigma = 4\)) for each action
Experience: \(10,000\) steps from random walk policy
Plots show LSPI iterations on a 50-state chain with a radial basis function approximator.
The colours represent two different actions of going left (blue) or right (red)
The plots show the true value function (dashed) or approximate value function
You can see it converges to the optimal policy after only \(7\) iterations.
Plots of policy improvement over LSPI iterations (using same 50-state chain example).