Tic-Tac-Toe with Monte Carlo Tree Search

I have been playing around with games and tree search algorithms lately. One algorithm that I wanted to try out is the Monte Carlo tree search, MCTS for short. It tries to find certain leaves of the tree which are successful in some sense. In a recent post I have started my own tree search library and set the foundation to try more games. In this post I want to show a bit of Tic-Tac-Toe and MCTS.

In the beginning I neither had a model of the game, nor the MCTS algorithm implemented. This article will go through both, and then I will show some of the results.

Modeling the game

First we need a model of the game. I just use ASCII art to show the state, that will have be sufficient for now. The empty board looks like this:

 | |
—————
 | |
—————
 | |

When we add one cross, we have nine different possibilities.

 x| |    |x|    | |x   | |    | |    | |    | |    | |    | |
 —————  —————  —————  —————  —————  —————  —————  —————  —————
  | |    | |    | |   x| |    |x|    | |x   | |    | |    | |
 —————  —————  —————  —————  —————  —————  —————  —————  —————
  | |    | |    | |    | |    | |    | |   x| |    |x|    | |x

They are pretty redundant. And we can actually reduce them to three classes:

 | |    | |    | |
—————  —————  —————
 | |    | |    |x|
—————  —————  —————
 | |x   |x|    | |

One doesn't have to pick these particular representatives for the classes, but it doesn't matter.

When we go one step further, we have the following normalized states:

 | |o   | |    | |    | |    | |    | |    | |    | |    | |    | |    | |    | |
—————  —————  —————  —————  —————  —————  —————  —————  —————  —————  —————  —————
 | |    | |o   | |    |o|    | |    | |x  o| |x   | |o   |o|    | |    |x|    |x|
—————  —————  —————  —————  —————  —————  —————  —————  —————  —————  —————  —————
x| |   x| |   o| |x   | |x   |o|x  o| |    | |    |x|    |x|    |x|o   | |o   |o|

Not all can be reached from the above representatives. I just re-normalize all the states, so the board gets rotated and mirrored all the time. Since the board has this D₄ symmetry (eight elements), it doesn't really matter whether one applies transformations.

The number of possible states in the game are listed in the following table. One can see that the normalization roughly reduces the number of states by a factor of 10. This is a big saving, and it wasn't that hard to implement.

Round Number of States Normalized States
0 1 1
1 9 3
2 72 12
3 504 66
4 3024 360
5 15120 1710
6 54720 5992
7 148176 15878
8 200448 20964
9 127872 13538

With a random walk

With the model of the game in place, we can just let it do a random walk. It looks like this:

 | | 
—————
 | | 
—————
 | |x

 | | 
—————
 | |o
—————
x| | 

 | |x
—————
 | | 
—————
x|o| 

 | |x
—————
 | |o
—————
x|o| 

 | |x
—————
 |x|o
—————
x|o| 

Here the starting player with the cross won. This is pure randomness, so there is nothing to learn. One can just see that the implementation of the game and the normalization seems to work. One can see how the board was rotated and mirrored twice.

There are a few different manifestations of MCTS. For this game we need the variant which has moves alternating between two players. There are four phases, though I found it more sensible to combine two phases.

  1. Start from the root and go down along the children which are already present until a leaf is found. The way that the children are chosen is such that the currently active player has the largest chance of winning.
  2. At a leaf, add all the next possible game moves to the tree. Pick one of them to go forward. Then add all next elements and go forward until the game has ended.
  3. Backpropagate the score, a +1 for all node where the winning player made a move. Otherwise +0.5 for all nodes in the current playout.

This way the algorithm will explore here and there, but mostly stick to tread-out paths. Reducing the tree by a lot of pruning gives the following after 10000 playouts. The game state is written in one line, so the three rows of the game are just concatenated after each other.

The two players are marked with colors, red and blue. The two numbers means “wins/total”. At each branching point, the ones with the highest ratio of wins should be taken.

You can see that the most tread-out path is one where the game ends in a draw. And that makes sense, because we know how Tic-Tac-Toe goes: When neither player makes a mistake, it ends in a draw. And this is what the algorithm finds while optimizing both players at the same time.

One could use this tree to play the game. One could refine it whenever a human opponent makes a move. This would narrow down the opportunities left. One would pick the option which would yield the most wins.

This algorithm can also be used to find other terminal nodes in a tree, one just has to adapt the backpropagation slightly.