06 Monte Carlo Tree Search (MCTS)

Planning with a Model

Planning with a Model

Given a model \(\mathcal{M}_\eta = \langle \mathcal{P}_\eta, \mathcal{R}_\eta \rangle\)

  • Solve the MDP \(\langle \mathcal{S}, \mathcal{A}, \mathcal{P}_\eta, \mathcal{R}_\eta \rangle\)

Using favourite planning algorithm

  • Tree search

  • \(\cdots\)

Sample-Based Planning

A simple but powerful approach to planning is to use the model only to generate samples

Sample experience from model

\[ \begin{align*} S_{t+1} & \sim \mathcal{P}_\eta(S_{t+1} \mid S_t, A_t)\\[0pt] R_{t+1} & = \mathcal{R}_\eta(R_{t+1} \mid S_t, A_t) \end{align*} \]

Apply model-free RL to samples, e.g.:

  • Q-learning

  • Monte-Carlo control or Sarsa

Sample-based planning methods are often more efficient

Back to the AB Example

  • Construct a table-lookup model from real experience

  • Apply model-free RL to sampled experience

Real experience

\(A, 0, B, 0\)

\(B, 1\)

\(B, 1\)

\(B, 1\)

\(B, 1\)

\(B, 1\)

\(B, 1\)

\(B, 0\)

Sampled experience

\(B, 1\)

\(B, 0\)

\(B, 1\)

\(A, 0, B, 1\)

\(B, 1\)

\(A, 0, B, 1\)

\(B, 1\)

\(B, 0\)

e.g. Monte-Carlo learning: \(V(A) = 1\); \(V(B) = 0.75\)

We can sample as many trajectories as we want from the model, unlike from the environment

  • we essentially have infinite data

Planning with an Inaccurate Model

Given an imperfect model \(\langle \mathcal{P}_\eta, \mathcal{R}_\eta \rangle \ne \langle \mathcal{P}, \mathcal{R} \rangle\)

  • Performance of model-based RL is limited to optimal policy
    for approximate MDP \(\langle \mathcal{S}, \mathcal{A}, \mathcal{P}_\eta, \mathcal{R}_\eta \rangle\)

  • i.e. Model-based RL is only as good as the estimated model

When the model is inaccurate, planning process will compute a sub-optimal policy

  • Solution 1: when model is wrong, use model-free RL

  • Solution 2: reason explicitly about model uncertainty (e.g. Bayesian approach)

Real and Simulated Experience

We consider two sources of experience

Real experience Sampled from environment (true MDP)

\[ \begin{align*} S' & \sim \mathcal{P}^a_{s,s'}\\[0pt] R & = \mathcal{R}^a_{s} \end{align*} \]

Simulated experience Sampled from model (approximate MDP)

\[ \begin{align*} S' & \sim \mathcal{P}_\eta(S' \mid S, A) \\[0pt] R & = \mathcal{R}_\eta(R \mid S, A) \end{align*} \]

Integrating Learning and Planning

Model-Free RL

  • No model

  • Learn value function (and/or policy) from real experience

Model-Based RL (using Sample-Based Planning)

  • Learn a model from real experience

  • Plan value function (and/or policy) from simulated experience

Integrated Architectures

  • Learn a model from real experience

  • Learn and plan value function (and/or policy) from real and simulated experience

Simulation-Based Search (1)

  • Forward search algorithms select the best action by look-ahead

  • They build a search tree with the current state \(s_t\) at the root

  • Using a model of the MDP to look ahead

No need to solve whole MDP, just sub-MDP starting from now

Simulation-Based Search (2)

  • Forward search paradigm using sample-based planning

  • Simulate episodes of experience from now with the model

  • Apply model-free RL to simulated episodes

Simulation-Based Search (3)

Simulate episodes of experience from now with the model

\[ \{ s_t^k, A_t^k, R_{t+1}^k, \ldots, S_T^k \}_{k=1}^K \;\sim\; \mathcal{M}_\nu \]

Apply model-free RL to simulated episodes

  • Monte-Carlo control \(\;\to\;\) Monte-Carlo search

  • Sarsa \(\;\to\;\) TD search

Monte-Carlo Tree Search (Evaluation)

Given a model \(\mathcal{M}_\nu\), simulate \(K\) episodes from current state \(s_t\) using current simulation policy \(\pi\)

\[ \{ {\color{red}{s_t}}, A_t^k, R_{t+1}^k, S_{t+1}^k, \ldots, S_T^k \}_{k=1}^K \;\sim\; \mathcal{M}_\nu, \pi \]

  • Build a search tree containing visited states and actions

  • Evaluate states \(Q(s,a)\) by mean return of episodes from each pair \(s,a\)

\[ Q(\textcolor{red}{s,a}) = \frac{1}{N(s,a)} \sum_{k=1}^K \sum_{u=t}^T \mathbf{1}(S_u, A_u = s,a) \, G_u \;\;\overset{P}{\to}\;\; q_\pi(s,a) \]

After search is finished, select current (real) action with maximum value in search tree \(a_t = \arg\max_{a \in \mathcal{A}} Q(s_t, a)\)

Monte-Carlo Tree Search (Simulation)

In MCTS, the simulation policy \(\pi\) improves

Each simulation consists of two phases (in-tree, out-of-tree)

  • Tree policy (improves): pick actions to maximise \(Q(S,A)\)

  • Default policy (fixed): pick actions randomly

Monte-Carlo Tree Search

Repeat (each simulation)
Evaluate states \(Q(S,A)\) by Monte-Carlo evaluation
Improve tree policy, e.g. by \(\epsilon\)–greedy(\(Q\))

Monte-Carlo control applied to simulated experience

  • Converges on the optimal search tree,
    \(Q(S,A) \;\to\; q_*(S,A)\)

Example: The Game of Go

Example: The Game of Go

The ancient oriental game of Go is \(2,500\) years old

Considered to be the hardest classic board game

Considered a grand challenge task for AI (John McCarthy)

Traditional game-tree search has failed in Go

Rules of Go

Usually played on 19x19, also 13x13 or 9x9 board

Simple rules, complex strategy

  • Black and white place down stones alternately

  • Surrounded stones are captured and removed

  • The player with more territory wins the game

Position Evaluation in Go

How good is a position \(s\)?

  • Reward function (undiscounted):

\[ \begin{align*} R_t & = 0 \quad \text{for all non-terminal steps } t < T\\[0pt] R_T & = \begin{cases} 1 & \text{if Black wins} \\ 0 & \text{if White wins} \end{cases} \end{align*} \]

  • Policy \(\pi = \langle \pi_B, \pi_W \rangle\) selects moves for both players

  • Value function (how good is position \(s\)):

\[ \begin{align*} v_\pi(s) & = \mathbb{E}_\pi \left[ R_T \mid S = s \right] = \mathbb{P}[ \text{Black wins} \mid S = s ]\\[0pt] v_*(s) & = \max_{\pi_B} \min_{\pi_W} v_\pi(s) \end{align*} \]

Monte-Carlo Evaluation in Go

Applying Monte-Carlo Tree Search (1)

Applying Monte-Carlo Tree Search (2)

Applying Monte-Carlo Tree Search (3)

Applying Monte-Carlo Tree Search (4)

Applying Monte-Carlo Tree Search (5)