banner
Nagi-ovo

Nagi-ovo

Breezing
github

Let's build AlphaZero

This article is a digestion and understanding of articles, video tutorials, and code implementations such as Sunrise: Understanding AlphaZero, MCTS, Self-Play, UCB from Scratch. It will start from the design principles of AlphaGo, gradually revealing how to build an AI Gomoku system that can surpass humans by deeply understanding the two core mechanisms of MCTS and Self-Play.

AlphaGo: From Imitation to Surpassing#

Evolution Path of AlphaGo#

The number of possible Go positions exceeds the number of atoms in the universe, rendering traditional exhaustive search methods completely ineffective. AlphaGo addresses this issue through a phased approach: first learning human knowledge, then continuously evolving through self-play.
Pasted image 20241116201853
This evolutionary process can be divided into three layers:

  1. Imitating human experts
  2. Self-play enhancement
  3. Learning to evaluate situations

Core Components#

Pasted image 20241115170343

Rollout Policy#

The leftmost lightweight policy network used for quick evaluation, with lower accuracy but high computational efficiency.

SL Policy Network#

The supervised learning policy network PσP_{\sigma} learns to play by imitating human games:

  • Input: Board state
  • Output: Probability distribution of the next move imitating human experts
  • Training data: 160,000 games, approximately 30 million move steps

RL Policy Network#

Similar to when your chess skills reach a certain level, you start reviewing games and playing against yourself, discovering new tactics and deeper strategies. The RL network PρP_{\rho} is initialized from the SL network, which has already surpassed humans and can find many powerful strategies that humans have not discovered.

  • This stage generates a large amount of self-play data, with the input still being the board state, and the output is the improved playing strategy through reinforcement learning.
  • Self-play with the historical version of the policy network pπp_\pi, the network adjusts parameters through the "winning" signal, reinforcing the moves that lead to victory:
tz(τ)logp(atst;ρ)\sum_{t} z(\tau) \log p(a_t \mid s_t; \rho)

Where:

  • τ\tau is the game sequence (s0,a0,s1,a1,...)(s_0,a_0,s_1,a_1,...)
  • z(τ)z(\tau) is the win-loss label (win is positive, loss is negative)

Value Network#

The value network vθv_\theta learns to evaluate situations and can be a CNN:

  • Training objective: Minimize mean squared error
i(zvθ(s))2\sum_{i} (z - v_\theta(s))^2
  • zz: Final win-loss result
  • vθ(s)v_\theta(s): Predicted winning probability

Supplementary Explanation#

About the relationship between PρP_\rho and vθv_\theta:

  • The policy network PρP_\rho provides specific move probabilities to guide the search.
  • The value network vθv_\theta provides situation evaluations, reducing unnecessary simulations in the search tree.
  • The combination of both allows MCTS to not only explore high win-rate paths faster but also significantly improve the overall playing level.

AlphaGo's MCTS Implementation#

Selection Phase#

Combining exploration and exploitation:

at=argmaxaQ(st,a)+u(st,a)a_t = \arg\max_a Q(s_t, a) + u(s_t, a)
u(st,a)P(st,a)1+N(st,a)u(s_t, a) \propto \frac{P(s_t, a)}{1 + N(s_t, a)}

Where:

  • Q(st,a)Q(s_t, a): Long-term return estimate
  • u(st,a)u(s_t, a): Exploration reward
  • P(st,a)P(s_t, a): Prior probability output from the policy network
  • N(st,a)N(s_t, a): Visit count

Simulation and Evaluation#

In the original MCTS algorithm, the role of the simulation phase is to conduct random games from the leaf nodes (newly expanded nodes) using a fast rollout policy until the game ends, then obtain a reward based on the outcome of the game. This reward is passed back to the nodes in the search tree to update their value estimates (i.e., Q(s,a)Q(s, a)).

This implementation is simple and direct, but the rollout policy is often random or based on simple rules, leading to potentially poor simulation quality. It can only provide short-term information and does not effectively integrate global strategic evaluations.

AlphaGo, however, combines the Value Network vθv_\theta with the rollout policy during the simulation phase. The Value Network provides more efficient leaf node estimates and global capability evaluations, while the rollout policy captures local short-term effects through rapid simulations.

Using the hyperparameter λ\lambda to balance vθ(sL)v_\theta(s_L) and zLz_L, it considers both local simulations and global evaluations. The evaluation function:

V(sL)=(1λ)vθ(sL)+λzL V(s_L) = (1 - \lambda)v_\theta(s_L) + \lambda z_L

Backpropagation#

During n times of MCTS, the update of node visit counts (where I\mathbb{I} is the indicator function, with visits being 1):

N(st,a)=i=1nI(s,a,i)N(s_t, a) = \sum_{i=1}^{n} \mathbb{I}(s, a, i)

Q value update, i.e., the long-term expected return of executing a to node sts_t:

Q(st,a)=1N(st,a)i=1nI(s,a,i)V(sLi)Q(s_t, a) = \frac{1}{N(s_t, a)} \sum_{i=1}^{n} \mathbb{I}(s, a, i) V(s_L^i)

Summary#

  1. Structural Innovation:

    • The policy network provides prior knowledge
    • The value network provides global evaluation
    • MCTS provides tactical validation
  2. Training Innovation:

    • Starting from supervised learning
    • Surpassing the teacher through reinforcement learning
    • Self-play generates new knowledge
  3. MCTS Improvement:

    • Using neural networks to guide the search
    • The Policy Network provides prior probabilities for exploration directions, while the Value Network enhances the accuracy of leaf node evaluations.
    • This combination of value network and rollout evaluations not only reduces the search width and depth but also significantly improves overall performance.
    • Efficient exploration-exploitation balance

This design enables AlphaGo to find efficient solutions in a vast search space, ultimately surpassing human levels.

Heuristic Search and MCTS#

MCTS is like an evolving explorer, searching for the best path in a decision tree.

Core Idea#

What is the essence of MCTS? Simply put, it is a "learn while playing" process. Imagine you are playing a brand new board game:

  • At first, you will try various possible moves (exploration)
  • Gradually discover that certain moves work better (exploitation)
  • Balance between exploring new strategies and exploiting known good strategies

This is exactly what MCTS does, but it systematizes this process mathematically. It is a rollout algorithm that guides strategy selection by accumulating value estimates from Monte Carlo simulations.

Algorithm Flow#

Monte Carlo Tree Search - YouTube This teacher explains the MCTS process very well.

The elegance of MCTS lies in its four simple yet powerful steps, which I will introduce using two perspectives, A and B:

  1. Selection

    • A: Do you know how children learn? They always hover between the known and the unknown. MCTS is the same: starting from the root node, it uses the UCB (Upper Confidence Bound) formula to weigh whether to choose a known good path or explore new possibilities.
    • B: Starting from the root node, select a subsequent node from the current node based on a specific strategy. This strategy is usually based on the search history of the tree, selecting the most promising path. For example, we perform action aa at each node based on the current strategy π(s)\pi(s) to balance exploration and exploitation, gradually delving deeper.
  2. Expansion

    • A: Just like an explorer opening up new areas on a map, when we reach a leaf node, we will expand downwards, creating new possibilities.
    • B: When the selected node still has unexplored child nodes, we expand new nodes under this node based on the set of possible actions. The purpose of this process is to increase the breadth of the decision tree, gradually generating possible decision paths. Through this expansion operation, we ensure that the search covers more possible state-action pairs (s,a)(s,a).
  3. Simulation

    • A: This is the most interesting part. Starting from the new node, we conduct a "hypothetical" game until the game ends. It's like playing chess while mentally simulating "if I move this way, my opponent will move that way...".
    • B: Starting from the currently expanded node, perform a random simulation (rollout), sampling along the current strategy π(s)\pi(s) in the MDP environment until the terminal state. This process provides a return estimate from the current node to the endpoint, offering numerical evidence for the quality of the path.
  4. Backpropagation

    • A: Finally, we return the results along the path, updating the statistics of each node. It's like saying, "Hey, I tried this path, and it worked pretty well (or not so well)".
    • B: After completing the simulation, backpropagate the estimated return of the simulation to all the nodes traversed to update their values. This process accumulates historical return information, making future choices more accurately trend towards high-reward paths.

Screenshot 2024-10-31 at 15.19.31

UCB1: The Perfect Balance of Exploration and Exploitation#

Here, I want to particularly mention the UCB1 formula, which is the soul of MCTS:

UCB1=Xˉi+C×(ln(N)/ni)UCB1 = X̄ᵢ + C × \sqrt{(ln(N)/nᵢ)}

Let's break it down:

  • XˉiX̄ᵢ is the average return (exploitation term)
  • (ln(N)/ni)\sqrt{(ln(N)/nᵢ)} is the uncertainty measure (exploration term)
  • CC is the exploration parameter

Like a good investment portfolio, it needs to focus on known good opportunities while remaining open to new ones (exploration-exploitation trade-off).

Best Multi-Armed Bandit Strategy? (feat: UCB Method) This video explains the Multi-Armed Bandit and UCB Method very well, and I borrow the example used by this teacher:

A Foodie Trying to Understand UCB#

Imagine you just arrived in a city with 100 restaurants to choose from, and you have 300 days. Every day you have to choose a restaurant to dine in, hoping to average the best food over these 300 days.

ε-greedy Strategy: Simple but Not Smart Enough#

This is like deciding by rolling a dice:

  • 90% of the time (ε=0.1): Go to the known best restaurant (exploitation)
  • 10% of the time: Randomly try a new restaurant (exploration)
def epsilon_greedy(restaurants, ratings, epsilon=0.1):
    if random.random() < epsilon:
        return random.choice(restaurants)  # Exploration
    else:
        return restaurants[np.argmax(ratings)]  # Exploitation

The effect is:

  • Exploration is completely random, potentially revisiting known bad restaurants
  • The exploration ratio is fixed and does not adjust over time
  • It does not consider the impact of visit counts

UCB Strategy: A Smarter Choice Trade-off#

The UCB formula in restaurant selection means:

Score=AverageScore+C×ln(TotalVisitDays)/VisitsofThisRestaurantScore = Average Score + C × \sqrt{ln(Total Visit Days)/Visits of This Restaurant}

For example, consider the situation of two restaurants on day 100:

Restaurant A:

  • Visited 50 times, average score 4.5
  • UCB score = 4.5+2×ln(100)/504.5+0.6=5.14.5 + 2×\sqrt{ln(100)/50} ≈ 4.5 + 0.6 = 5.1

Restaurant B:

  • Visited 5 times, average score 4.0
  • UCB score = 4.0+2×ln(100)/54.0+1.9=5.94.0 + 2×\sqrt{ln(100)/5} ≈ 4.0 + 1.9 = 5.9

Although Restaurant B has a lower average score, due to fewer visits and higher uncertainty, UCB gives it a higher exploration reward.

Screenshot 2024-11-07 at 13.24.48

Code implementation:

class Restaurant:
    def __init__(self, name):
        self.name = name
        self.total_rating = 0
        self.visits = 0
        
def ucb_choice(restaurants, total_days, c=2):
    # Ensure each restaurant is visited at least once
    for r in restaurants:
        if r.visits == 0:
            return r
            
    # Use UCB formula to choose a restaurant
    scores = []
    for r in restaurants:
        avg_rating = r.total_rating / r.visits
        exploration_term = c * math.sqrt(math.log(total_days) / r.visits)
        ucb_score = avg_rating + exploration_term
        scores.append(ucb_score)
        
    return restaurants[np.argmax(scores)]

Why is UCB Better?#

  1. Adaptive Exploration

    • Restaurants with fewer visits receive higher exploration rewards
    • As total days increase, the exploration term gradually decreases to better focus on exploitation
  2. Balanced Time Investment

    • It does not waste too much time on clearly inferior restaurants
    • It reasonably allocates visit counts among restaurants with similar potential
  3. Theoretical Guarantee

    • Regret Bound (the gap from the optimal choice) grows logarithmically over time
    • 300 days of exploration is sufficient to find the best few restaurants

Now, back to MCTS.

Why is MCTS So Powerful?#

  1. No Domain Knowledge Required
    It does not require any specific game strategy knowledge and can find good actions solely through self-exploration.

  2. Can Stop Anytime
    Interrupting the search at any time can yield the current best choice. This is particularly useful in real-time systems.

  3. Natural Exploration-Exploitation Balance
    Through the UCB mechanism, it automatically balances between exploring new strategies and exploiting known good strategies.

AlphaZero: From MCTS to Self-Evolution#

Self-Play#

Here, we learn & draw from the excellent implementation of schinger/alphazero.py.

References#

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.