This article is a digestion and understanding of articles, video tutorials, and code implementations such as Sunrise: Understanding AlphaZero, MCTS, Self-Play, UCB from Scratch. It will start from the design principles of AlphaGo, gradually revealing how to build an AI Gomoku system that can surpass humans by deeply understanding the two core mechanisms of MCTS and Self-Play.
AlphaGo: From Imitation to Surpassing#
Evolution Path of AlphaGo#
The number of possible Go positions exceeds the number of atoms in the universe, rendering traditional exhaustive search methods completely ineffective. AlphaGo addresses this issue through a phased approach: first learning human knowledge, then continuously evolving through self-play.
This evolutionary process can be divided into three layers:
- Imitating human experts
- Self-play enhancement
- Learning to evaluate situations
Core Components#
Rollout Policy#
The leftmost lightweight policy network used for quick evaluation, with lower accuracy but high computational efficiency.
SL Policy Network#
The supervised learning policy network learns to play by imitating human games:
- Input: Board state
- Output: Probability distribution of the next move imitating human experts
- Training data: 160,000 games, approximately 30 million move steps
RL Policy Network#
Similar to when your chess skills reach a certain level, you start reviewing games and playing against yourself, discovering new tactics and deeper strategies. The RL network is initialized from the SL network, which has already surpassed humans and can find many powerful strategies that humans have not discovered.
- This stage generates a large amount of self-play data, with the input still being the board state, and the output is the improved playing strategy through reinforcement learning.
- Self-play with the historical version of the policy network , the network adjusts parameters through the "winning" signal, reinforcing the moves that lead to victory:
Where:
- is the game sequence
- is the win-loss label (win is positive, loss is negative)
Value Network#
The value network learns to evaluate situations and can be a CNN:
- Training objective: Minimize mean squared error
- : Final win-loss result
- : Predicted winning probability
Supplementary Explanation#
About the relationship between and :
- The policy network provides specific move probabilities to guide the search.
- The value network provides situation evaluations, reducing unnecessary simulations in the search tree.
- The combination of both allows MCTS to not only explore high win-rate paths faster but also significantly improve the overall playing level.
AlphaGo's MCTS Implementation#
Selection Phase#
Combining exploration and exploitation:
Where:
- : Long-term return estimate
- : Exploration reward
- : Prior probability output from the policy network
- : Visit count
Simulation and Evaluation#
In the original MCTS algorithm, the role of the simulation phase is to conduct random games from the leaf nodes (newly expanded nodes) using a fast rollout policy until the game ends, then obtain a reward based on the outcome of the game. This reward is passed back to the nodes in the search tree to update their value estimates (i.e., ).
This implementation is simple and direct, but the rollout policy is often random or based on simple rules, leading to potentially poor simulation quality. It can only provide short-term information and does not effectively integrate global strategic evaluations.
AlphaGo, however, combines the Value Network with the rollout policy during the simulation phase. The Value Network provides more efficient leaf node estimates and global capability evaluations, while the rollout policy captures local short-term effects through rapid simulations.
Using the hyperparameter to balance and , it considers both local simulations and global evaluations. The evaluation function:
Backpropagation#
During n times of MCTS, the update of node visit counts (where is the indicator function, with visits being 1):
Q value update, i.e., the long-term expected return of executing a to node :
Summary#
-
Structural Innovation:
- The policy network provides prior knowledge
- The value network provides global evaluation
- MCTS provides tactical validation
-
Training Innovation:
- Starting from supervised learning
- Surpassing the teacher through reinforcement learning
- Self-play generates new knowledge
-
MCTS Improvement:
- Using neural networks to guide the search
- The Policy Network provides prior probabilities for exploration directions, while the Value Network enhances the accuracy of leaf node evaluations.
- This combination of value network and rollout evaluations not only reduces the search width and depth but also significantly improves overall performance.
- Efficient exploration-exploitation balance
This design enables AlphaGo to find efficient solutions in a vast search space, ultimately surpassing human levels.
Heuristic Search and MCTS#
MCTS is like an evolving explorer, searching for the best path in a decision tree.
Core Idea#
What is the essence of MCTS? Simply put, it is a "learn while playing" process. Imagine you are playing a brand new board game:
- At first, you will try various possible moves (exploration)
- Gradually discover that certain moves work better (exploitation)
- Balance between exploring new strategies and exploiting known good strategies
This is exactly what MCTS does, but it systematizes this process mathematically. It is a rollout algorithm that guides strategy selection by accumulating value estimates from Monte Carlo simulations.
Algorithm Flow#
Monte Carlo Tree Search - YouTube This teacher explains the MCTS process very well.
The elegance of MCTS lies in its four simple yet powerful steps, which I will introduce using two perspectives, A and B:
-
Selection
- A: Do you know how children learn? They always hover between the known and the unknown. MCTS is the same: starting from the root node, it uses the UCB (Upper Confidence Bound) formula to weigh whether to choose a known good path or explore new possibilities.
- B: Starting from the root node, select a subsequent node from the current node based on a specific strategy. This strategy is usually based on the search history of the tree, selecting the most promising path. For example, we perform action at each node based on the current strategy to balance exploration and exploitation, gradually delving deeper.
-
Expansion
- A: Just like an explorer opening up new areas on a map, when we reach a leaf node, we will expand downwards, creating new possibilities.
- B: When the selected node still has unexplored child nodes, we expand new nodes under this node based on the set of possible actions. The purpose of this process is to increase the breadth of the decision tree, gradually generating possible decision paths. Through this expansion operation, we ensure that the search covers more possible state-action pairs .
-
Simulation
- A: This is the most interesting part. Starting from the new node, we conduct a "hypothetical" game until the game ends. It's like playing chess while mentally simulating "if I move this way, my opponent will move that way...".
- B: Starting from the currently expanded node, perform a random simulation (rollout), sampling along the current strategy in the MDP environment until the terminal state. This process provides a return estimate from the current node to the endpoint, offering numerical evidence for the quality of the path.
-
Backpropagation
- A: Finally, we return the results along the path, updating the statistics of each node. It's like saying, "Hey, I tried this path, and it worked pretty well (or not so well)".
- B: After completing the simulation, backpropagate the estimated return of the simulation to all the nodes traversed to update their values. This process accumulates historical return information, making future choices more accurately trend towards high-reward paths.
UCB1: The Perfect Balance of Exploration and Exploitation#
Here, I want to particularly mention the UCB1 formula, which is the soul of MCTS:
Let's break it down:
- is the average return (exploitation term)
- is the uncertainty measure (exploration term)
- is the exploration parameter
Like a good investment portfolio, it needs to focus on known good opportunities while remaining open to new ones (exploration-exploitation trade-off).
Best Multi-Armed Bandit Strategy? (feat: UCB Method) This video explains the Multi-Armed Bandit and UCB Method very well, and I borrow the example used by this teacher:
A Foodie Trying to Understand UCB#
Imagine you just arrived in a city with 100 restaurants to choose from, and you have 300 days. Every day you have to choose a restaurant to dine in, hoping to average the best food over these 300 days.
ε-greedy Strategy: Simple but Not Smart Enough#
This is like deciding by rolling a dice:
- 90% of the time (ε=0.1): Go to the known best restaurant (exploitation)
- 10% of the time: Randomly try a new restaurant (exploration)
def epsilon_greedy(restaurants, ratings, epsilon=0.1):
if random.random() < epsilon:
return random.choice(restaurants) # Exploration
else:
return restaurants[np.argmax(ratings)] # Exploitation
The effect is:
- Exploration is completely random, potentially revisiting known bad restaurants
- The exploration ratio is fixed and does not adjust over time
- It does not consider the impact of visit counts
UCB Strategy: A Smarter Choice Trade-off#
The UCB formula in restaurant selection means:
For example, consider the situation of two restaurants on day 100:
Restaurant A:
- Visited 50 times, average score 4.5
- UCB score =
Restaurant B:
- Visited 5 times, average score 4.0
- UCB score =
Although Restaurant B has a lower average score, due to fewer visits and higher uncertainty, UCB gives it a higher exploration reward.
Code implementation:
class Restaurant:
def __init__(self, name):
self.name = name
self.total_rating = 0
self.visits = 0
def ucb_choice(restaurants, total_days, c=2):
# Ensure each restaurant is visited at least once
for r in restaurants:
if r.visits == 0:
return r
# Use UCB formula to choose a restaurant
scores = []
for r in restaurants:
avg_rating = r.total_rating / r.visits
exploration_term = c * math.sqrt(math.log(total_days) / r.visits)
ucb_score = avg_rating + exploration_term
scores.append(ucb_score)
return restaurants[np.argmax(scores)]
Why is UCB Better?#
-
Adaptive Exploration
- Restaurants with fewer visits receive higher exploration rewards
- As total days increase, the exploration term gradually decreases to better focus on exploitation
-
Balanced Time Investment
- It does not waste too much time on clearly inferior restaurants
- It reasonably allocates visit counts among restaurants with similar potential
-
Theoretical Guarantee
- Regret Bound (the gap from the optimal choice) grows logarithmically over time
- 300 days of exploration is sufficient to find the best few restaurants
Now, back to MCTS.
Why is MCTS So Powerful?#
-
No Domain Knowledge Required
It does not require any specific game strategy knowledge and can find good actions solely through self-exploration. -
Can Stop Anytime
Interrupting the search at any time can yield the current best choice. This is particularly useful in real-time systems. -
Natural Exploration-Exploitation Balance
Through the UCB mechanism, it automatically balances between exploring new strategies and exploiting known good strategies.
AlphaZero: From MCTS to Self-Evolution#
Self-Play#
Here, we learn & draw from the excellent implementation of schinger/alphazero.py.