06 Model-based RL (Value Iteration & MCTS)

Slides

This module is also available in the following versions

Solving MDPs

MDP Solutions

A “solution” to an MDP, i.e. an RL agent may include one or more of these components, depending on the method used:

  • Policy: agent’s behaviour function

  • Value function: how good is each state and/or action

  • Model: agent’s representation of the environment

Policy

A policy is the agent’s behaviour

It is a map from state to action, e.g.

  • Deterministic policy: \(a = \pi(s)\)

  • Stochastic policy: \(\pi(a|s) = \mathbb{P}[A_t = a|S_t = s]\)

  • Why might a stochastic policy be desirable?

Example: Gridworld

  • states = cells
  • actions = up, down, left, right, 10% chance of 90\(\degree\) error
  • reward \(r(s) = 0\) for non-terminal nodes
  • discount factor \(\gamma = 0.9\)

Value Function

A value function \(V(s)\) is a prediction of future reward attainable starting from state \(s\). \[ \begin{align} V(s) &= \max_{a \in A(s)} Q(s, a) \\ &= \max_a \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots | S_t = s, A_t = a] \end{align} \]

Idea: lets us evaluate a state as good/bad (according to long term reward potential), and choose actions likely to take us to “better” states.

Model

A model predicts what the environment will do next

\(\mathcal{P}\) predicts the probability of the next state

\(\mathcal{R}\) predicts the expectation of the next reward, e.g.

\[ \mathcal{P}^a_{ss'} = \mathbb{P}[S_{t+1} = s' | S_t = s, A_t = a] \]

\[ \mathcal{R}^a_s = \mathbb{E}[R_{t+1} | S_t = s, A_t = a] \]

More “expensive” than learning value function or policy, which compress this info into “what I actually care about” and “what I should do”.

Categorizing RL methods

Value Based

  • No Policy (Implicit)

  • Value Function

Policy Based

  • Policy

  • No Value Function

Actor Critic

  • Policy
  • Value Function

Model Free

  • Policy and/or Value Function

  • No Model

Model Based

  • Policy and/or Value Function

  • Model

Value Iteration

The Bellman Equation

Value Iteration uses dynamic programming to recursively calculate a value function from an MDP.

\[ V(s) = \overbrace{\max_{a \in A(s)}}^{\text{best action from $s$}} \overbrace{\underbrace{\sum_{s' \in S}}_{\text{for every state}} P_a(s' \mid s) [\underbrace{R(s,a,s')}_{\text{(est. of) immediate reward}} + \underbrace{\gamma}_{\text{discount factor}} \cdot \underbrace{V(s')}_{\text{value of } s'}]}^{\text{expected reward $Q(s, a)$ of executing action $a$ in state $s$}} \]

Note: value iteration is model-based because it explicitly uses probability and reward distributions in calculation.

Calculating \(V(s)\) by recursion

\(V_0(s) = 0\) for all states \(s\)

\(V_{i+1}(s) = \max\limits_{a \in A(s)} \sum\limits_{s' \in S} P_a(s' \mid s) [R(s, a, s') + \gamma V_i(s')]\)

Example: GridWorld

\[ V_1(s) = \max\limits_{a \in A(s)} \sum\limits_{s' \in S} P_a(s' \mid s) R(s) + \gamma \underbrace{V_0(s')}_{0}] \]

\(V_0\)

\(V_1\)

Example continued

\[ V_2(s) = \max\limits_{a \in A(s)} \sum\limits_{s' \in S} P_a(s' \mid s) [R(s) + \gamma V_1(s')] \]

\(V_1\)

\(V_2\)

Policy Extraction

Once value function stabilizes, we extract a deterministic policy as

\[ \pi(s) = \text{argmax}_{a \in A(s)} \sum\limits_{s' \in S} P_a(s' \mid s)\ [R(s,a,s') + \gamma\ V(s')] \]

Monte Carlo Tree Search (MCTS)

Model-based methods

  • Replace real world with the agent’s (simulated) model of the environment

  • Supports rollouts (lookaheads) under imagined actions to reason about what value function will be, without further environment interaction

Model Learning

Goal: estimate model \(\mathcal{M}_\eta\) from experience \(\{S_1, A_1, R_2, \ldots, S_T\}\)

This is a supervised learning problem

\[ \begin{aligned} S_1, A_1 &\;\to\; R_2, S_2 \\ S_2, A_2 &\;\to\; R_3, S_3 \\ &\;\vdots \\ S_{T-1}, A_{T-1} &\;\to\; R_T, S_T \end{aligned} \]

  • Learning \(s,a \to r\) is a regression problem

  • Learning \(s,a \to s'\) is a density estimation problem

  • Pick loss function, e.g. mean-squared error, KL divergence, …

  • Find parameters \(\eta\) that minimise empirical loss

ExpectiMax Trees

Monte Carlo Tree search (MCTS) is based on representing MDPs as ExpectiMax trees that unfold as agents take actions.

Note: state nodes are full histories

MCTS

  • Simultaneously “build” tree and maintain estimates of \(Q(s,a)\) by simulating trajectories and observing rewards

  • Focus search on promising (as indicated by \(Q\)) states when state space is too large for value iteration

  • Maintain visit count \(N(s)\) for states to inform strategy

MCTS Algoirithm

  1. Starting from root node, successively select child nodes until leaf is reached.
  2. Expand children by choosing action and creating new nodes based on possible outcomes.
  3. Simulate trajectory from one of these nodes by choosing actions either at random or with heuristic
  4. Backpropagate the total reward to prior nodes using Bellman equation to update \(Q(a, s)\).

Multi-armed Bandits

Framework for selection strategy balancing exploration and exploitation

Multi-armed bandits

Intuition: For every state \(s\), each action \(a\) = slot machine with initually unknown payoff distributions \(X_{a}\).


The “value” \(Q(s, a)\) of \(a\) corresponds to (current estimate of) \(X_{a}\).

\[ Q_t(a, s) = \frac{R(a, t)}{N(a,s, t)} \]

where \(N(a, s, t) =\) the number of times action \(a\) has been performed in state \(s\) at time \(t\), and \(R(a, t)\) is the total reward earned from those actions as of time \(t\).

Bandit algorithm

Input: Multi-armed bandit problem \(M = \langle a_i \sim X_i \rangle\)

Output: \(Q\)-values \(Q(a_i)\)

  • \(Q(a_i) \leftarrow 0\) for all arms \(a_i\)

  • \(N(a_i) \leftarrow 0\) for all arms \(a_i\)

  • \(t \leftarrow 1\)

while \(t < T\) do

  • \(a \leftarrow\) select(\(k\))

  • execute arm \(a\) for round \(k\) and observe reward \(r(a, t)\)

  • \(N(a) \leftarrow N(a) + 1\)

  • \(Q(a) \leftarrow Q(a) + \frac{1}{N(a)}[r(a, t) 0 Q(a)]\)

  • \(t \leftarrow t+1\)

Bandit Selection Rules

  • \(\epsilon\)-first
  • \(\epsilon\)-greedy
  • \(\epsilon\)-decreasing
  • Upper confidence bound (UCB)
  • softmax

\(\epsilon\)-first

Perform random actions for \(\epsilon*T\) rounds, where \(T =\) total number of rounds.

For each strategy, consider:

  • What do the parameeters mean for us as modelers?

  • What does it mean for these numbers to be bigger/smaller?

  • What are the benefits and pitfalls of each strategy?

  • In what situations might we choose a particular strategy with particular parameter values?

\(\epsilon\)-greedy

  • with probability \(1 - \epsilon\), choose the arm with maximum \(Q\)-value (exploit)

  • with probability \(\epsilon\), choose arm uniformly at random (explore)

\(\epsilon\)-decreasing

  • Adds a variable \(\alpha\) = “decay”

  • Like \(\epsilon\) greedy, but explore probability is \(\epsilon*\alpha^t\) at timestep \(t\)

UCB1 (Upper confidence bound)

\[ \text{argmax}_{a}\Bigl( \overbrace{Q_t(a)}^{\text{exploit best action}} + \underbrace{\sqrt{\frac{2 \ln t}{N_t(a)}}}_{\text{explore to reduce ``UCB" on alternatives}}\Bigr) \]

  • Based on Hoeffding’s inequality in probability theory, moderates “greediness” based on the likely accuracy of estimate of \(Q(a)\) given available data

  • UCB “exploration bonus” term grows slowly (logarithmically) with \(t\), but shrinks with larger sample size \(N_t(a)\).

softmax

\[ P(a) = \frac{e^{Q(a)/\tau}}{\sum_{b\in A} e^{Q(b)/\tau}} \]

Approximates ratio of (exponential funciton of) current expected return for \(a\) compared to alternatives

\(\tau\) \(\geq 0\) is the temperature parameter. Dictates influence of current value of \(Q\) on decision.

  • lower \(\tau\) = more exploitation: small differences in \(Q\) get amplified in exponent, approaching argmax as \(\tau \to 0^+\).

  • higher \(\tau\) = more exploration: approaches uniform random as \(\tau \to \infty\).

softmax continued

  • Softmax works well under drift– when underlying reward distributons change over time

  • Graph shows softmax performance for “restless bandit” where reward distributions abruptly change at \(t= 250\).

Comparing bandit strategies

no drift

drift

Upper Confidence Trees (UCT)

MCTS commonly uses a variant of UCB for its selection rule:

\[ \text{argmax}_{a \in A(s)} \Bigl( Q(s, a) + \alpha\sqrt{\frac{2 \ln(N(s))}{N(s, a)}} \Bigr) \]

where \(N(s)\) is the number of times a node has been visited, \(N(s, a)\) is the number of times action \(a\) has been chosen in that state, and \(\alpha\) is an exploration parameter.

  • Note: when \(N(a) = 0\), second term is “infinity”, i.e., always choose unexplored actions first.