8 Integrating Learning and Planning (MCTS)

Slides

This module is also available in the following versions

Model-Based Reinforcement Learning


Last Module: MDPs

This Module: learn model directly from experience

  • \(\ldots\) and use planning to construct a value function or policy

Integrates learning and planning into a single architecture

Model-Based Reinforcement Learning

Model-Based and Model-Free RL

Model-Free RL

  • No model

  • Learn value function (and/or policy) from experience

Model-Based RL

  • Learn a model from experience

  • Plan value function (and/or policy) from model

Lookahead by planning (or thinking) about what the value function will be

Model-Free RL

Model-Based RL

Replace real world with the agent’s (simulated) model of the environment

  • Supports rollouts (lookaheads) under imagined actions to reason about what value function will be, without further environment interaction

Model-Based RL (2)

Advantages of Model-Based RL

Advantages:

  • Can efficiently learn model by supervised learning methods

  • Can reason about model uncertainty, and even take actions to reduce uncertainty

Disadvantages:

  • First learn a model, then construct a value function

\(\;\;\;\Rightarrow\) two sources of approximation error

Learning a Model

What is a Model?

A model \(\mathcal{M}\) is a representation of an MDP \(\langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R} \rangle\), parameterised by \(\eta\)

  • We will assume state space \(\mathcal{S}\) and action space \(\mathcal{A}\) are known

So a model \(\mathcal{M} = \langle \mathcal{P}_\eta, \mathcal{R}_\eta \rangle\) represents state transitions
\(\mathcal{P}_\eta \approx \mathcal{P}\) and rewards \(\mathcal{R}_\eta \approx \mathcal{R}\)

\[ S_{t+1} \sim \mathcal{P}_\eta(S_{t+1} \mid S_t, A_t) \]

\[ R_{t+1} = \mathcal{R}_\eta(R_{t+1} \mid S_t, A_t) \]


Typically assume conditional independence between state transitions and rewards

\[ \mathbb{P}[S_{t+1}, R_{t+1} \mid S_t, A_t] \;=\; \mathbb{P}[S_{t+1} \mid S_t, A_t] \; \mathbb{P}[R_{t+1} \mid S_t, A_t] \]

Note you can learn from each (one-step) transition, treating the following step as the supervisor for the prior step.

Model Learning

Goal: estimate model \(\mathcal{M}_\eta\) from experience \(\{S_1, A_1, R_2, \ldots, S_T\}\)

This is a supervised learning problem

\[ \begin{aligned} S_1, A_1 &\;\to\; R_2, S_2 \\ S_2, A_2 &\;\to\; R_3, S_3 \\ &\;\vdots \\ S_{T-1}, A_{T-1} &\;\to\; R_T, S_T \end{aligned} \]

  • Learning \(s,a \to r\) is a regression problem

  • Learning \(s,a \to s'\) is a density estimation problem

  • Pick loss function, e.g. mean-squared error, KL divergence, …

  • Find parameters \(\eta\) that minimise empirical loss

Examples of Models

  • Table Lookup Model

  • Linear Expectation Model

  • Linear Gaussian Model

  • Gaussian Process Model

  • Deep Belief Network Model

  • \(\cdots\) almost any supervised learning model

Table Lookup Model

Model is an explicit MDP, \(\hat{\mathcal{P}}, \hat{\mathcal{R}}\)

  • Count visits \(N(s,a)\) to each state–action pair (parametric approach)

\[ \begin{align*} \hat{\mathcal{P}}^{a}_{s,s'} & = \frac{1}{N(s,a)} \sum_{t=1}^T \mathbf{1}(S_t, A_t, S_{t+1} = s, a, s')\\[0pt] \hat{\mathcal{R}}^{a}_{s} & = \frac{1}{N(s,a)} \sum_{t=1}^T \mathbf{1}(S_t, A_t = s, a)\, R_t \end{align*} \]

  • Alternatively (a simple non-parametric approach)
    • At each time-step \(t\), record experience tuple \(\langle S_t, A_t, R_{t+1}, S_{t+1} \rangle\)
    • To sample model, randomly pick tuple matching \(\langle s,a,\cdot,\cdot \rangle\)

AB Example (Revisited) - Building a Model

Two states \(A,B\); no discounting; \(8\) episodes of experience

\(A, 0, B, 0\)

\(B, 1\)

\(B, 1\)

\(B, 1\)

\(B, 1\)

\(B, 1\)

\(B, 1\)

\(B, 0\)

We have constructed a table lookup model from the experience

Planning with a Model

Planning with a Model

Given a model \(\mathcal{M}_\eta = \langle \mathcal{P}_\eta, \mathcal{R}_\eta \rangle\)

  • Solve the MDP \(\langle \mathcal{S}, \mathcal{A}, \mathcal{P}_\eta, \mathcal{R}_\eta \rangle\)

Using favourite planning algorithm

  • Value iteration

  • Policy iteration

  • Tree search

  • \(\cdots\)

Sample-Based Planning

A simple but powerful approach to planning is to use the model only to generate samples

Sample experience from model

\[ \begin{align*} S_{t+1} & \sim \mathcal{P}_\eta(S_{t+1} \mid S_t, A_t)\\[0pt] R_{t+1} & = \mathcal{R}_\eta(R_{t+1} \mid S_t, A_t) \end{align*} \]

Apply model-free RL to samples, e.g.:

  • Q-learning

  • Monte-Carlo control or Sarsa

Sample-based planning methods are often more efficient

Back to the AB Example

  • Construct a table-lookup model from real experience

  • Apply model-free RL to sampled experience

Real experience

\(A, 0, B, 0\)

\(B, 1\)

\(B, 1\)

\(B, 1\)

\(B, 1\)

\(B, 1\)

\(B, 1\)

\(B, 0\)

Sampled experience

\(B, 1\)

\(B, 0\)

\(B, 1\)

\(A, 0, B, 1\)

\(B, 1\)

\(A, 0, B, 1\)

\(B, 1\)

\(B, 0\)

e.g. Monte-Carlo learning: \(V(A) = 1\); \(V(B) = 0.75\)


We can sample as many trajectories as we want from the model, unlike from the environment

  • we essentially have infinite data

Planning with an Inaccurate Model

Given an imperfect model \(\langle \mathcal{P}_\eta, \mathcal{R}_\eta \rangle \ne \langle \mathcal{P}, \mathcal{R} \rangle\)

  • Performance of model-based RL is limited to optimal policy
    for approximate MDP \(\langle \mathcal{S}, \mathcal{A}, \mathcal{P}_\eta, \mathcal{R}_\eta \rangle\)

  • i.e. Model-based RL is only as good as the estimated model

When the model is inaccurate, planning process will compute a sub-optimal policy

  • Solution 1: when model is wrong, use model-free RL

  • Solution 2: reason explicitly about model uncertainty (e.g. Bayesian approach)

Real and Simulated Experience

We consider two sources of experience

Real experience Sampled from environment (true MDP)

\[ \begin{align*} S' & \sim \mathcal{P}^a_{s,s'}\\[0pt] R & = \mathcal{R}^a_{s} \end{align*} \]

Simulated experience Sampled from model (approximate MDP)

\[ \begin{align*} S' & \sim \mathcal{P}_\eta(S' \mid S, A) \\[0pt] R & = \mathcal{R}_\eta(R \mid S, A) \end{align*} \]

Integrating Learning and Planning

Model-Free RL

  • No model

  • Learn value function (and/or policy) from real experience

Model-Based RL (using Sample-Based Planning)

  • Learn a model from real experience

  • Plan value function (and/or policy) from simulated experience


Integrated Architectures

  • Learn a model from real experience

  • Learn and plan value function (and/or policy) from real and simulated experience

Example: The Game of Go

Example: The Game of Go

The ancient oriental game of Go is \(2,500\) years old

Considered to be the hardest classic board game

Considered a grand challenge task for AI (John McCarthy)

Traditional game-tree search has failed in Go

Rules of Go

Usually played on 19x19, also 13x13 or 9x9 board

Simple rules, complex strategy

  • Black and white place down stones alternately

  • Surrounded stones are captured and removed

  • The player with more territory wins the game

Position Evaluation in Go

How good is a position \(s\)?

  • Reward function (undiscounted):

\[ \begin{align*} R_t & = 0 \quad \text{for all non-terminal steps } t < T\\[0pt] R_T & = \begin{cases} 1 & \text{if Black wins} \\ 0 & \text{if White wins} \end{cases} \end{align*} \]

  • Policy \(\pi = \langle \pi_B, \pi_W \rangle\) selects moves for both players

  • Value function (how good is position \(s\)):

\[ \begin{align*} v_\pi(s) & = \mathbb{E}_\pi \left[ R_T \mid S = s \right] = \mathbb{P}[ \text{Black wins} \mid S = s ]\\[0pt] v_*(s) & = \max_{\pi_B} \min_{\pi_W} v_\pi(s) \end{align*} \]

Monte-Carlo Evaluation in Go

Applying Monte-Carlo Tree Search (1)

Applying Monte-Carlo Tree Search (2)

Applying Monte-Carlo Tree Search (3)

Applying Monte-Carlo Tree Search (4)

Applying Monte-Carlo Tree Search (5)