06 Model-based RL (Value Iteration & MCTS)

Solving MDPs

MDP Solutions

A “solution” to an MDP, i.e. an RL agent may include one or more of these components, depending on the method used:

Policy: agent’s behaviour function
Value function: how good is each state and/or action
Model: agent’s representation of the environment

Policy

A policy is the agent’s behaviour

It is a map from state to action, e.g.

Deterministic policy: $a = \pi(s)$
Stochastic policy: $\pi(a|s) = \mathbb{P}[A_t = a|S_t = s]$

Why might a stochastic policy be desirable?

Example: Gridworld

states = cells
actions = up, down, left, right, 10% chance of 90$\degree$ error
reward $r(s) = 0$ for non-terminal nodes
discount factor $\gamma = 0.9$

Value Function

A value function $V(s)$ is a prediction of future reward attainable starting from state $s$. \[ \begin{align} V(s) &= \max_{a \in A(s)} Q(s, a) \\ &= \max_a \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots | S_t = s, A_t = a] \end{align} \]

Idea: lets us evaluate a state as good/bad (according to long term reward potential), and choose actions likely to take us to “better” states.

Model

A model predicts what the environment will do next

$\mathcal{P}$ predicts the probability of the next state

$\mathcal{R}$ predicts the expectation of the next reward, e.g.

\[ \mathcal{P}^a_{ss'} = \mathbb{P}[S_{t+1} = s' | S_t = s, A_t = a] \]

\[ \mathcal{R}^a_s = \mathbb{E}[R_{t+1} | S_t = s, A_t = a] \]

More “expensive” than learning value function or policy, which compress this info into “what I actually care about” and “what I should do”.

Categorizing RL methods

Value Based

No Policy (Implicit)
Value Function

Policy Based

Policy
No Value Function

Actor Critic

Policy
Value Function

Model Free

Policy and/or Value Function
No Model

Model Based

Policy and/or Value Function
Model

Value Iteration

The Bellman Equation

Value Iteration uses dynamic programming to recursively calculate a value function from an MDP.

\[ V(s) = \overbrace{\max_{a \in A(s)}}^{\text{best action from $s$}} \overbrace{\underbrace{\sum_{s' \in S}}_{\text{for every state}} P_a(s' \mid s) [\underbrace{R(s,a,s')}_{\text{(est. of) immediate reward}} + \underbrace{\gamma}_{\text{discount factor}} \cdot \underbrace{V(s')}_{\text{value of } s'}]}^{\text{expected reward $Q(s, a)$ of executing action $a$ in state $s$}} \]

Note: value iteration is model-based because it explicitly uses probability and reward distributions in calculation.

Calculating $V(s)$ by recursion

$V_0(s) = 0$ for all states $s$

$V_{i+1}(s) = \max\limits_{a \in A(s)} \sum\limits_{s' \in S} P_a(s' \mid s) [R(s, a, s') + \gamma V_i(s')]$

Example: GridWorld

\[ V_1(s) = \max\limits_{a \in A(s)} \sum\limits_{s' \in S} P_a(s' \mid s) R(s) + \gamma \underbrace{V_0(s')}_{0}] \]

$V_0$

$V_1$

Example continued

\[ V_2(s) = \max\limits_{a \in A(s)} \sum\limits_{s' \in S} P_a(s' \mid s) [R(s) + \gamma V_1(s')] \]

$V_1$

$V_2$

Policy Extraction

Once value function stabilizes, we extract a deterministic policy as

\[ \pi(s) = \text{argmax}_{a \in A(s)} \sum\limits_{s' \in S} P_a(s' \mid s)\ [R(s,a,s') + \gamma\ V(s')] \]

Monte Carlo Tree Search (MCTS)

Model-based methods

Replace real world with the agent’s (simulated) model of the environment
Supports rollouts (lookaheads) under imagined actions to reason about what value function will be, without further environment interaction

Model Learning

Goal: estimate model $\mathcal{M}_\eta$ from experience $\{S_1, A_1, R_2, \ldots, S_T\}$

This is a supervised learning problem

\[ \begin{aligned} S_1, A_1 &\;\to\; R_2, S_2 \\ S_2, A_2 &\;\to\; R_3, S_3 \\ &\;\vdots \\ S_{T-1}, A_{T-1} &\;\to\; R_T, S_T \end{aligned} \]

Learning $s,a \to r$ is a regression problem
Learning $s,a \to s'$ is a density estimation problem
Pick loss function, e.g. mean-squared error, KL divergence, …
Find parameters $\eta$ that minimise empirical loss

ExpectiMax Trees

Monte Carlo Tree search (MCTS) is based on representing MDPs as ExpectiMax trees that unfold as agents take actions.

Note: state nodes are full histories

MCTS

Simultaneously “build” tree and maintain estimates of $Q(s,a)$ by simulating trajectories and observing rewards
Focus search on promising (as indicated by $Q$) states when state space is too large for value iteration
Maintain visit count $N(s)$ for states to inform strategy

MCTS Algoirithm

Starting from root node, successively select child nodes until leaf is reached.
Expand children by choosing action and creating new nodes based on possible outcomes.
Simulate trajectory from one of these nodes by choosing actions either at random or with heuristic
Backpropagate the total reward to prior nodes using Bellman equation to update $Q(a, s)$.

Multi-armed Bandits

Framework for selection strategy balancing exploration and exploitation

Multi-armed bandits

Intuition: For every state $s$, each action $a$ = slot machine with initually unknown payoff distributions $X_{a}$.

The “value” $Q(s, a)$ of $a$ corresponds to (current estimate of) $X_{a}$.

\[ Q_t(a, s) = \frac{R(a, t)}{N(a,s, t)} \]

where $N(a, s, t) =$ the number of times action $a$ has been performed in state $s$ at time $t$, and $R(a, t)$ is the total reward earned from those actions as of time $t$.

Bandit algorithm

Input: Multi-armed bandit problem $M = \langle a_i \sim X_i \rangle$

Output: $Q$-values $Q(a_i)$

$Q(a_i) \leftarrow 0$ for all arms $a_i$
$N(a_i) \leftarrow 0$ for all arms $a_i$
$t \leftarrow 1$

while $t < T$ do

$a \leftarrow$ select($k$)
execute arm $a$ for round $k$ and observe reward $r(a, t)$
$N(a) \leftarrow N(a) + 1$
$Q(a) \leftarrow Q(a) + \frac{1}{N(a)}[r(a, t) 0 Q(a)]$
$t \leftarrow t+1$

Bandit Selection Rules

$\epsilon$-first
$\epsilon$-greedy
$\epsilon$-decreasing
Upper confidence bound (UCB)
softmax

$\epsilon$-first

Perform random actions for $\epsilon*T$ rounds, where $T =$ total number of rounds.

For each strategy, consider:

What do the parameeters mean for us as modelers?
What does it mean for these numbers to be bigger/smaller?
What are the benefits and pitfalls of each strategy?
In what situations might we choose a particular strategy with particular parameter values?

$\epsilon$-greedy

with probability $1 - \epsilon$, choose the arm with maximum $Q$-value (exploit)
with probability $\epsilon$, choose arm uniformly at random (explore)

$\epsilon$-decreasing

Adds a variable $\alpha$ = “decay”
Like $\epsilon$ greedy, but explore probability is $\epsilon*\alpha^t$ at timestep $t$

UCB1 (Upper confidence bound)

\[ \text{argmax}_{a}\Bigl( \overbrace{Q_t(a)}^{\text{exploit best action}} + \underbrace{\sqrt{\frac{2 \ln t}{N_t(a)}}}_{\text{explore to reduce ``UCB" on alternatives}}\Bigr) \]

Based on Hoeffding’s inequality in probability theory, moderates “greediness” based on the likely accuracy of estimate of $Q(a)$ given available data
UCB “exploration bonus” term grows slowly (logarithmically) with $t$, but shrinks with larger sample size $N_t(a)$.

softmax

\[ P(a) = \frac{e^{Q(a)/\tau}}{\sum_{b\in A} e^{Q(b)/\tau}} \]

Approximates ratio of (exponential funciton of) current expected return for $a$ compared to alternatives

$\tau$ $\geq 0$ is the temperature parameter. Dictates influence of current value of $Q$ on decision.

lower $\tau$ = more exploitation: small differences in $Q$ get amplified in exponent, approaching argmax as $\tau \to 0^+$.
higher $\tau$ = more exploration: approaches uniform random as $\tau \to \infty$.

softmax continued

Softmax works well under drift– when underlying reward distributons change over time
Graph shows softmax performance for “restless bandit” where reward distributions abruptly change at $t= 250$.

Comparing bandit strategies

no drift

drift

Upper Confidence Trees (UCT)

MCTS commonly uses a variant of UCB for its selection rule:

\[ \text{argmax}_{a \in A(s)} \Bigl( Q(s, a) + \alpha\sqrt{\frac{2 \ln(N(s))}{N(s, a)}} \Bigr) \]

where $N(s)$ is the number of times a node has been visited, $N(s, a)$ is the number of times action $a$ has been chosen in that state, and $\alpha$ is an exploration parameter.

Note: when $N(a) = 0$, second term is “infinity”, i.e., always choose unexplored actions first.