Fundamentals of Reinforcement Learning and Q-Learning

This year, I participated in Kaggle competitions using the DeepSeek-Math-7B-RL model, learning from Claude 3.5 Sonnet as a teacher. The strength of both models is largely attributed to RL. I have a vague feeling that the technology in this field is very strong and beautiful, so I prepared to explore it a bit. However, my foundation is not solid, and I couldn't understand OpenAI Spinning Up, so I chose to sample an introductory course like HuggingFace DeepRL.

Introduction#

The concept of reinforcement learning is that an agent (AI) learns from the environment through interaction (via trial and error) and receives rewards (positive or negative) as feedback for executing actions, thereby learning from the environment.

Pasted image 20240923005651

The agent obtains the first frame from the environment, receiving state $S_0$ . Based on $S_0$ , the agent takes action $A_0$ (for example, moving to the right), and the environment enters a new frame (new state $S_1$ ), providing the agent with a certain reward $R_1$ (for example, not dying +1 positive reward).

The agent's goal is to maximize its cumulative reward, referred to as expected return.

Reward Hypothesis: The Core Idea of Reinforcement Learning#

Reinforcement learning is based on the reward hypothesis, which states that all goals can be described as maximizing expected return (expected cumulative reward). In reinforcement learning, to achieve optimal behavior, our goal is to learn to take actions that maximize expected cumulative reward.

Tasks are divided into two categories#

Episodic: Has a starting point and an ending point. (Game levels)
Continuous: Has a starting point but no ending point. (Self-driving)

What is the difference between State and Observation?#

State is a complete description of the world state (no hidden information), such as a chessboard.
Observation is a partial description of the state, such as a view in a Mario game.

Policy-based Methods#

In this method, the policy is learned directly.
It maps each state to the best corresponding action for that state or the probability distribution of the possible action set for that state.

Value-based Methods#

Instead of training a policy, it trains a value function, which maps each state to the expected value of being in that state.
Pasted image 20240925220141

Exploration/Exploitation Trade-off#

In reinforcement learning, we need to balance the degree of exploration of the environment with the degree of exploitation of known information.

Pasted image 20240924233504

Where does depth come from?#

Deep neural networks are introduced to solve reinforcement learning problems.

Pasted image 20240923223626

Practical Example - HUGGY#

It is recommended to play directly using Colab. If running in WSL2, you need to set up a virtual display:

import os
os.environ['PYVIRTUALDISPLAY_DISPLAYFD'] = '0'

from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

Pasted image 20240924234251

State Space#

We want to train Huggy to pick up the stick we throw. This means it needs to move correctly towards the stick.

The environmental information we provide includes:

The position of the target (stick)
Its relative position to the target
The orientation of its legs.

Based on all this information, Huggy can use its Policy to decide what action to take next to achieve its goal.

Action Space#

The actions that Huggy can perform.

Pasted image 20240924234309

Joint motors drive Huggy's legs.

Therefore, to achieve the goal, Huggy needs to learn how to correctly rotate the joint motors of each leg to move.

Reward Function#

Thus, our reward function includes:

Direction reward: Reward for getting closer to the target.
Goal achievement reward: We reward Huggy for achieving the goal.
Time penalty: A fixed time penalty given for each action, forcing it to reach the stick as quickly as possible.
Rotation penalty: Penalty for excessive and rapid rotations.

Introduction to Q-Learning#

value-based

v_{\pi}(s) = \mathbb{E}_{\pi} \left[ R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \,|\, S_t = s \right]

Starting from state s, we act according to policy $\pi$ .
We accumulate future rewards, but rewards further away are discounted by factor $\gamma$ .
The value function calculates the expected "return" from the current state onward.

The goal of an RL agent is to have an optimal policy π*.

Relationship between Value and Policy#

Finding the optimal value function allows us to derive the optimal policy.

\pi^*(s) = \arg \max_{a} Q^*(s, a)

$\pi^*(s)$ : Represents the optimal policy, the best action to take when in state $s$ .
$Q^*(s, a)$ : Represents the state-action value function, which indicates the expected value of future discounted returns when taking action $a$ in state $s$ .
$\arg \max_{a}$ : Represents choosing the action $a$ that maximizes $Q^*(s, a)$ , i.e., finding the action that maximizes expected return.

The core idea of this formula is that if we have an optimal $Q^*(s, a)$ function, we can determine the optimal policy $\pi^*(s)$ by selecting the action that maximizes $Q^*(s, a)$ . In other words, the optimal policy chooses the action $a$ that maximizes future expected return in each state $s$ .

Two Types of Value-Based Methods#

Imagine a scenario where a mouse is searching for cheese in a maze:

The state value function is akin to the mouse's expected mood when at a certain position (state): "If I start from here and act according to my policy, how likely am I to find cheese?"
The action value function goes a step further: "If I start from here and choose to move right, what is my probability of success?" The action value function helps the mouse evaluate the effects of executing different actions in each state, allowing for more targeted action direction.

1. State Value Function#

The state value function $V_\pi(s)$ evaluates the expected cumulative reward starting from a given state $s$ under a specific policy $\pi$ . In other words, it measures "If I am currently in state $s$ and follow policy $\pi$ , what will my long-term return be?"

V_\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]

$V_\pi(s)$ : The value of state $s$ , i.e., the expected cumulative return from starting at $s$ and following policy $\pi$ .
$G_t$ : The cumulative reward starting from time step $t$ , usually expressed as $G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots$ , where $\gamma$ is the discount factor.
$\mathbb{E}_\pi[\cdot]$ : Represents the expectation under policy $\pi$ .

Suppose we have a maze, and the mouse starts from a certain position (state $s$ ) and follows a certain policy to find cheese. The state value function can tell us how quickly the mouse is expected to find cheese from its current position.

Screenshot 2024-09-30 at 20.30.51

In the example above, the value of each grid indicates the value of a certain state. For instance, if the value of state $s$ is $-6$ , it means the expected cumulative reward from starting here is poor (possibly because it is far from the cheese), while states closer to the cheese (like those with a value of $-2$ ) have better expected returns.

2. Action Value Function#

The action value function $Q_\pi(s, a)$ evaluates the expected cumulative reward after taking action $a$ from a given state $s$ under a specific policy $\pi$ . It considers not only the value of the state but also evaluates the expected return from taking a specific action $a$ from that state.

Q_\pi(s, a) = \mathbb{E}_\pi[G_t | S_t = s, A_t = a]

$Q_\pi(s, a)$ : The value of the state-action pair $(s, a)$ , i.e., the expected cumulative return from taking action $a$ in state $s$ and following policy $\pi$ .
Unlike the state value function, the action value function evaluates the long-term return of executing a specific action $a$ starting from state $s$ .

Screenshot 2024-10-01 at 00.43.07

Using the previous maze example, suppose the mouse is now in state $s$ and can choose four directions (up, down, left, right) to search for cheese. The action value function $Q_\pi(s, a)$ can tell us the expected return for the mouse choosing each direction. For instance, the value of moving right from a certain state $s$ might be higher than moving left because moving right gets closer to the cheese.

Summary of Differences#

The state value function $V_\pi(s)$ only considers the cumulative expected return of executing future actions in a certain state according to the policy. It does not focus on which specific action was taken.
The action value function $Q_\pi(s, a)$ is more detailed; it considers not only the return starting from a certain state but also the expected return after executing a specific action in that state.

Bellman Equation#

Returning to the previous example, if the mouse starts from state $St$ and receives a reward of -1 for each step:

V(St) = (-1) + (-1) + (-1) + (-1) + (-1) + (-1) = -6

This indicates that the cumulative return from state $St$ to the endpoint is -6.

V(St+1) = (-1) + (-1) + (-1) + (-1) + (-1) = -5

Calculating the cumulative return for each state directly like this is clearly a costly process, as we need to consider every possible path and its rewards, along with a lot of repeated calculations. The Bellman equation provides us with a more concise recursive method to solve this problem.

The core idea of the Bellman equation is that we can express the value of a state as the immediate reward of that state plus the discounted sum of the future state values. The formula is as follows:

V_\pi(s) = \mathbb{E}_\pi[R_{t+1} + \gamma V_\pi(S_{t+1}) | S_t = s]

This way, we do not need to compute the cumulative return for each state from scratch but can recursively add the immediate reward and the value of the next state.

Let's return to the previous example. If we already know that the value of state $St+1$ is -5, then the value of state $St$ can be recursively calculated using the Bellman equation:

V(St) = R_{t+1} + \gamma V(St+1)

If there is no discounting (i.e., $\gamma = 1$ ), and we know that $V(St+1) = -5$ and the immediate reward $R_{t+1} = -1$ , then:

Screenshot 2024-10-01 at 01.02.43

This is consistent with our previous direct calculation.

The Role of the Discount Factor $\gamma$ #

The discount factor $\gamma$ in the Bellman equation determines how much we value future rewards.

If $\gamma$ is low (e.g., 0.1 or 0), we almost completely ignore future rewards and focus on the current immediate return.
If $\gamma = 1$ , we consider all future rewards, making every future step as important as the current one.
If $\gamma$ is too large (like a million), it actually loses physical meaning because future values will be infinitely exaggerated, which is unreasonable in practical applications.

Monte Carlo Methods and Temporal Difference Learning#

Monte Carlo and Temporal Difference Learning are two different strategies for training value functions or policy functions, both of which leverage experience to solve reinforcement learning problems.

Monte Carlo: Learning at the End of Episodes#

Monte Carlo methods are a way to estimate the value of states or state-action pairs through repeated sampling. In this method, we update the value function based on the complete experience of each episode after it ends. Below, I will explain how the Monte Carlo method works through a specific example.

Record state, action, reward, and next state:
- Each time the agent takes an action, it records the current state, the action taken, the reward received, and the next state it transitions to.
Calculate total return $G_t$ :
- The agent accumulates all rewards in the episode to calculate the total return $G_t$ , which is used to evaluate the current policy's performance.
Update value function $V(s)$ :
- Update the value estimate for a certain state $s$ using the formula:
  $V(S_t) \leftarrow V(S_t) + \alpha [G_t - V(S_t)]$

Where:

$\alpha$ is the learning rate, indicating the update step size.
$G_t$ is the actual total return obtained.
$V(S_t)$ is the current value estimate for state $S_t$ .

After updating, the agent will start the next episode with the new state value function and continue to learn and improve its policy through multiple games.

Using the same scenario:

Suppose at the beginning of the game, the mouse has a value of 0 for each state (i.e., $V(S) = 0$ ).
The learning rate $\alpha = 0.1$ , and we do not apply discounting (i.e., $\gamma = 1$ ).
The mouse receives a series of rewards during its exploration, with an accumulated return of $G_0 = 3$ .

Pasted image 20241001011239

Algorithm flow:

Return $G_0 = 3$ .
Update state value $V(S_0)$ :
$V(S_0) = V(S_0) + \alpha [G_0 - V(S_0)]$
Substituting known values:
$V(S_0) = 0 + 0.1 \times (3 - 0) = 0.3$

Through this episode's update, the mouse's value estimate for state $S_0$ becomes 0.3.

Temporal Difference Learning: Learning at Every Step#

In contrast to the Monte Carlo method, Temporal Difference Learning (TD Learning) does not wait until the end of an entire episode to update. Instead, it gradually adjusts based on one-step experiences. The update is based on the immediate reward $R_{t+1}$ observed from the current state $S_t$ and the estimated value of the next state $S_{t+1}$ .

The formula is as follows:

V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]

Where:

$\alpha$ is the learning rate.
$R_{t+1}$ is the immediate reward.
$\gamma$ is the discount factor, controlling the weight of future rewards.

This update method is called bootstrapping, as it partially relies on existing value estimates (i.e., $V(S_{t+1})$ ) for updating rather than complete samples $G_t$ .

Continuing with our mouse searching for cheese:

Initialize value function: Initially, we assume the value of each state is 0.
Agent explores: The mouse starts from a certain state $S_0$ , randomly chooses an action, and suppose it moves left and eats cheese. At this point, it receives an immediate reward $R_{t+1} = 1$ .
Calculate TD target:
The TD target is the immediate reward plus the discounted value of the next state:

$TD_{Target} = R_{t+1} + \gamma V(S_{t+1})$

In this example, suppose the value of the next state is $V(S_1) = 0$ and $\gamma = 1$ , then:

$TD_{Target} = 1 + 1 \times 0 = 1$
Update value function:
Use the TD target to update the value of state $S_0$ :

$V(S_0) = V(S_0) + \alpha [TD_{Target} - V(S_0)]$

Substituting values:

$V(S_0) = 0 + 0.1 \times (1 - 0) = 0.1$

Thus, the mouse updates the value of the current state after each step. As the mouse continues to interact with the environment, the value function will gradually reflect the true values of each state.

Comparison Summary: From the Perspective of Bias and Variance#

Monte Carlo Method: Waits until the end of an episode to update the value function by calculating the cumulative return. It uses the actual return value $G_t$ . Since it updates the value function based on complete episode experience, it can provide relatively accurate value estimates, especially suitable for situations without explicit models or transition probabilities.

\text{Monte Carlo:} \quad V(S_t) \leftarrow V(S_t) + \alpha [G_t - V(S_t)]

- **Unbiased**: The Monte Carlo method directly obtains complete returns from actual experience, making it unbiased, meaning the estimated expected value aligns with the true expected value. 
- **High Variance**: Since it calculates the return for the entire episode, involving multiple state transitions and rewards, it leads to higher variance in estimates.
- MC methods are not affected by bias. They have high variance but are unbiased, suitable for evaluating global experiences.

Temporal Difference Learning: Updates the value function immediately after each step, allowing learning without complete returns, using estimated returns, i.e., $R_{t+1} + \gamma V(S_{t+1})$ instead of $G_t$ .

\text{TD Learning:} \quad V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]

- **Biased**: The TD method uses bootstrapping, meaning the initial values are often arbitrary and not necessarily accurate, leading to initial bias. However, as more real experiences accumulate, the bias gradually diminishes. The presence of bias can lead to serious issues, especially when using off-policy + function approximators (like neural networks) combined with bias, often resulting in convergence failures. Sutton and Barto's **“deadly triad”** refers to this situation.
- **Low Variance**: Because it only involves single-step state transitions and rewards, its variance is relatively low.
- Low variance but biased, initial estimates may be inaccurate, but suitable for real-time updates.

Q-Learning is an off-policy value-based method that uses TD methods to train its action value function: the Q function. The letter Q stands for "Quality," indicating the value of taking that action in that state.

Screenshot 2024-10-01 at 17.47.52

Given a state and action, the Q function outputs a state-action value (also known as Q value).

Here, let's review the difference between Value and Reward:

Term	Explanation
Value of a state	Expected cumulative reward, i.e., the total reward expected when the agent starts from a certain state and acts according to the policy.
Reward	The immediate feedback received from the environment after performing an action in a certain state.

Internally, our Q function is encoded by a Q Table, where each cell in the table corresponds to a state-action pair value. This Q Table can be viewed as the memory of the Q function (Cheat Sheet).

Pasted image 20241001182112

It can be seen that the initial state and the state-action value for moving up are 0.

Thus, when training is complete, we will obtain an optimal Q function, which means we have an optimal Q Table, equivalent to having an optimal policy, as we can determine the best action to take in each state from the Q Table.

\pi^*(s) = \arg \max_a Q^*(s, a)

Algorithm Flow#

Scary pseudocode:

Pasted image 20241001193949

Step 1: Initialize Q Table#

Initialize the Q table for each state-action pair. Most often initialized to 0.
Screenshot 2024-10-01 at 19.47.59

def initialize_q_table(state_space, action_space):
	Qtable = np.zeros((state_space, action_space))
	return Qtable

Step 2: Choose an action using ε-greedy policy#

The ε-greedy policy is a strategy that balances exploration and exploitation. By gradually reducing the probability of exploration, the agent tends to explore new actions in the early stages of learning, while later relying more on learned knowledge to choose the best action.

With probability $1 - \epsilon$ : Choose the action with the highest Q value in the current Q-table (exploitation).
With probability $\epsilon$ : Randomly choose an action (exploration).

Screenshot 2024-10-01 at 19.49.27

Initial Stage: When $\epsilon = 1.0$ , the agent explores completely, i.e., randomly selects actions.
Training Process: As training progresses, $\epsilon$ gradually decreases (e.g., exponential decay), and the agent increasingly chooses the currently estimated optimal action, i.e., exploitation.

Pasted image 20241001200750

def greedy_policy(Qtable, state):
    # Exploitation: Choose the action with the highest state-action value
    # Classic Q-learning operation: Query the Q table to get the action with the highest reward in the current state.
    # Simple, direct, purely "exploitation."
    action = np.argmax(Qtable[state][:])  # np.argmax finds the index of the maximum value.
    return action

def epsilon_greedy_policy(Qtable, state, epsilon):
    random_num = random.uniform(0,1) # Random number
    
    # If random number > epsilon --> Exploitation
    # This is key: with a probability of 1 - epsilon, we choose exploitation,
    # greedily selecting the best action in the Q table.
    if random_num > epsilon:
        # Choose the action with the highest value in the given state
        action = greedy_policy(Qtable, state)
    
    # Otherwise --> Exploration
    # If the random number is less than epsilon, the agent will explore,
    # randomly selecting an action to try new possibilities.
    else:
        action = env.action_space.sample()  # Exploration: choose a random action.
    
    return action

Step 3: Execute action $A_t$ , receive reward $R_{t+1}$ , and transition to next state $S_{t+1}$ #

In Q-learning, this step is the core of the algorithm. When the agent executes action $A_t$ , it receives feedback from the environment (reward $R_{t+1}$ ) and transitions to a new state $S_{t+1}$ . At this point, the agent needs to consider not only the immediate reward obtained but also evaluate the potential future returns.

This leads us to the Q-learning update formula.

Step 4: Update $Q(S_t, A_t)$ #

Q-learning uses a very important update rule, which we have seen in previous temporal difference learning:

Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t) \right]

$Q(S_t, A_t)$ is our current estimate of the state-action pair value.
$R_{t+1}$ is the immediate reward obtained after executing action $A_t$ .
$\max_a Q(S_{t+1}, a)$ is the Q value of the optimal action in the next state $S_{t+1}$ based on the current policy.
$\alpha$ is the learning rate, which determines how quickly we update the Q value.
$\gamma$ is the discount factor, which balances the weight of immediate rewards and future rewards.

Through this method, each update in Q-learning adjusts based on the optimal Q value of the next state.

def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
  # External training loop: Execute learning for n_training_episodes times
  for episode in tqdm(range(n_training_episodes)):
    # Epsilon decay: As training progresses, epsilon gradually decreases, reducing the probability of exploration and increasing exploitation
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)

    # At the start of each round, reset the environment and obtain the initial state
    state, info = env.reset()
    step = 0
    terminated = False
    truncated = False
    # In each round, the agent interacts with the environment for a maximum of max_steps steps
    for step in range(max_steps):
      # Use epsilon greedy policy to choose action
      action = epsilon_greedy_policy(Qtable, state, epsilon)
      # Execute action and observe new state and new reward
      new_state, reward, terminated, truncated, info = env.step(action)

      # Update Q value:
      # Q(s,a) = Q(s,a) + learning rate * (reward + discount factor * max(Q(s',a')) - Q(s,a))
      Qtable[state][action] = Qtable[state][action] + learning_rate * (reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action])
      # If the agent ends in the current state (success or failure), stop this training round
      if terminated or truncated:
        break

      # Set the agent's next state to the new state
      state = new_state
  return Qtable

Screenshot 2024-10-02 at 00.38.10

Off-policy vs On-policy#

Now, let's discuss the two types of policies—Off-policy and On-policy.

From the perspective of policy learning, the key difference between the two is:

On-policy: Learns the policy currently being executed, where the agent's actual behavior aligns with the learned policy.
Off-policy: Learns a policy different from the one currently being executed; even if the agent is executing an exploration policy, it updates based on the optimal policy.

Off-policy#

Off-policy uses two different policies for executing actions (acting policy) and updating Q values (updating policy). In the Q-learning we introduced earlier, the agent uses the epsilon-greedy policy for action selection (acting policy), but when updating Q values, it uses the greedy policy (greedy policy) to find the optimal action.

For example, when updating Q values, the agent will choose the optimal action $a$ in the next state, i.e.:

\gamma \max_a Q(S_{t+1}, a)

This means that even if our current policy is epsilon-greedy, we always use the greedy policy to update Q values, thus accelerating convergence. This is why Q-learning is referred to as an Off-policy algorithm.

On-policy#

On-policy, on the other hand, requires the same policy to be used for both action execution and Q value updates. For example, in the SARSA algorithm, the action chosen during the update is also based on the current policy, rather than the greedy policy. This means the agent always uses the epsilon-greedy policy for both selecting and updating actions, leading to a learning process that is closer to the actual execution policy.

To understand with an example:

SARSA (On-policy): Suppose the agent is now in state $S_t$ , it chooses action $A_t$ , then transitions to new state $S_{t+1}$ and chooses new action $A_{t+1}$ . SARSA uses $A_{t+1}$ (the action chosen based on the current epsilon-greedy policy) to update Q values.
Q-learning (Off-policy): In the same situation, Q-learning also chooses action $A_t$ from $S_t$ and transitions to state $S_{t+1}$ . However, when updating Q values, it does not care which action $A_{t+1}$ was actually chosen in $S_{t+1}$ , but instead uses the maximum Q value of all actions in $S_{t+1}$ (i.e., the optimal policy) to update Q values.

Let's Dive Deeper!#

If it were just a simple game like "Frozen Lake," building a Q table would clearly allow us to obtain an excellent Q function in a relatively short time.

However, when the environment becomes a game like "Doom," which has a vast state space (millions of different states), the efficiency of creating and updating a Q table will be very low.

This is why we now introduce deep Q-learning, using a neural network to approximate the different Q values for each action given a state.