From DQN to Policy Gradient

:::note{title=“Review”} Q-Learning is an algorithm used to train the Q-function, which is an action-value function that determines the value of taking a specific action in a specific state. It maintains a Q-table to store memories of all state-action pair values. :::

For Atari games like Space Invaders, we need to use game frames as states and inputs. A single frame is 210x160 pixels with 3 color channels (RGB). Thus, the observation space shape is (210, 160, 3). Each pixel value ranges from 0 to 255, leading to $256^{210×160×3}=256^{100800}$ possible observations.

PG results

In such cases, generating and updating a Q-table is extremely inefficient. Therefore, we use Deep Q-Learning instead of tabular methods like Q-Learning, choosing a neural network as the Q-function approximator. This network approximates the Q-values for every possible action given a state.

DQN

Input Preprocessing and Temporal Limitations

We aim to reduce state complexity to decrease the computation time required for training.

Stochastic policy

Grayscaling

Color doesn’t provide essential information, so three color channels (RGB) can be reduced to one.

Screen Cropping

Regions that do not contain important information can be cropped out.

Capturing Temporal Information

Movement information (direction, speed) of a pixel cannot be determined from a single frame. To obtain temporal information, we stack four frames together.

CNN

Stacked frames are processed through three convolutional layers to capture and utilize spatial relationships within the images. Additionally, because frames are stacked, we can obtain temporal information across frames.

MLP

Finally, fully connected layers serve as the output, providing a Q-value for each possible action in that state.

PG advantages

class QNetwork(nn.Module):
    def __init__(self, env):
        super().__init__()
        self.network = nn.Sequential(
            nn.Conv2d(4, 32, 8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, 4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, 3, stride=1),
            nn.ReLU(),
            self.network.add_module("Flatten", nn.Flatten()),  # Flatten multidimensional input to 1D
            nn.Linear(3136, 512),  # Fully connected layer mapping 3136D input to 512D
            nn.ReLU(),
            nn.Linear(512, env.single_action_space.n),  # Output layer corresponding to action space dimension
        )
 
    def forward(self, x):
        return self.network(x / 255.0)  # Normalize input to [0,1] range

Network Input: A 4-frame stack passed through the network as the state. Output: A vector of Q-values for each possible action in that state. Similar to Q-Learning, we then simply use an epsilon-greedy policy to choose which action to take.

During the training phase, we no longer directly update the Q-values of state-action pairs as in Q-Learning. Instead, we optimize the weights of the DQN by designing a loss function and using gradient descent.

PG applications

Training Workflow

The Deep Q-Learning training algorithm has two phases:

Sampling: Executing actions and storing the observed experience tuples in a replay buffer.
Training: Randomly selecting a small batch of tuples and learning from that batch using a gradient descent update step.

PG comparison

Because Deep Q-Learning (off-policy) combines a non-linear Q-value function (neural network) with bootstrapping (updating targets using current estimates instead of actual complete returns, which is biased), its training process can be unstable. Sutton and Barto’s “deadly triad” refers to this exact situation.

Stabilizing Training

To help stabilize training, we implement three different solutions:

Experience Replay for more efficient use of experience.
Fixed Q-Targets to stabilize training.
Double DQN to solve the problem of Q-value overestimation.

Experience Replay

Experience replay in Deep Q-Learning serves two functions:

More efficient use of experience during training. In typical online reinforcement learning, an agent interacts with the environment to gain experience (state, action, reward, and next state), learns from it (updates the neural network), and then discards it, which is very inefficient. Experience replay helps by more efficiently utilizing training experience. We use a replay buffer to save experience samples so they can be reused during training.

Actor Critic intro

The agent can learn multiple times from the same experience.

Avoid forgetting prior experience (catastrophic interference or catastrophic forgetting) and reduce the correlation between experiences. The Replay Buffer allows for storing experience tuples during interaction and then drawing small batches from them. This prevents the network from learning only the most recently executed actions. Randomly sampling experiences diversifies the data the agent is exposed to, preventing overfitting to short-term states and avoiding severe fluctuations or catastrophic divergence in action values.

Reinforce

Sampling experience and calculating loss:

rb = ReplayBuffer(
    args.buffer_size, # Size of the replay buffer, determining how much experience to store.
    envs.single_observation_space,
    envs.single_action_space,
    device,
    optimize_memory_usage=True,
    handle_timeout_termination=False,
)
 
if global_step > args.learning_starts:
    if global_step % args.train_frequency == 0:
        data = rb.sample(args.batch_size) # Randomly sample a batch
        with torch.no_grad():
            target_max, _ = target_network(data.next_observations).max(dim=1)
            td_target = data.rewards.flatten() + args.gamma * target_max * (1 - data.dones.flatten())
        old_val = q_network(data.observations).gather(1, data.actions).squeeze()
        loss = F.mse_loss(td_target, old_val)

Fixed Q-Targets

A key issue in Q-Learning is that parameters for the TD target (Q-Target) and the current Q-value (Q-estimate) are shared. This leads to both the Q-target and Q-estimate changing simultaneously, like chasing a moving target. A beautiful analogy is a cowboy (Q-estimate) trying to catch a moving cow (Q-target). Although the cowboy gets closer to the cow (error decreases), the target keeps moving, causing significant oscillations during training.

DQN limitations

PG disadvantages

Continuous actions

Love this representation 🥹

To solve this, we introduce Fixed Q-Targets. The core idea is to introduce an independent network that doesn’t update at every timestep but instead copies parameters from the main network every C steps. This means our target (Q-Target) remains fixed over multiple timesteps and the network is updated only against old estimates. This significantly reduces the oscillation between target and estimate.

Value vs Policy

As shown in the pseudo-code above, the key is using two different networks: a main network (for choosing actions and updating) and a target network (for calculating Q-Targets), with main network weights copied to the target network every C steps. This stabilizes the training process, allowing the “cowboy to catch the cow more effectively,” reducing oscillations and speeding up convergence.

q_network = QNetwork(envs).to(device) # Current policy network, responsible for action selection and Q-value prediction
optimizer = optim.Adam(q_network.parameters(), lr=args.learning_rate)
target_network = QNetwork(envs).to(device) # Target network for calculating TD targets, providing stable learning goals.
target_network.load_state_dict(q_network.state_dict()) # Initialization: target network parameters match the current policy network

Every args.target_network_frequency steps, the main network parameters are fully copied to the target network. This keeps the Q-Target fixed for multiple timesteps, significantly reducing the oscillation between target and estimate.

tau = 1.0
 
if global_step % args.target_network_frequency == 0:
    for target_param, param in zip(target_network.parameters(), q_network.parameters()):
        target_param.data.copy_(args.tau * param.data + (1.0 - args.tau) * target_param.data)

Double DQN

Double DQN was proposed by Hado van Hasselt specifically to solve the problem of Q-value overestimation.

In Q-Learning TD-Target calculations, a common question is “How to determine if the best action for the next state is the one with the highest Q-value?” We know that Q-value accuracy depends on the actions we’ve tried and the neighboring states we’ve explored. Thus, early in training, information about the best action is insufficient. Choosing an action solely based on the highest Q-value can lead to misjudgment.

For instance, if a sub-optimal action is assigned a Q-value higher than the best action, the learning process becomes complicated and hard to converge. To address this, Double DQN introduces two networks to decouple action selection from Q-value target generation:

Main Network (DQN Network): Used to select the best action for the next state (i.e., the action with the highest Q-value).
Target Network: Used to calculate the target Q-value resulting from executing that action.

Future directions

:::note{title=“Terminology”} Use the main network to select the best action, but use the target network to calculate the Q-value. This approach reduces Q-value overestimation and makes the training process more stable and efficient. :::

with torch.no_grad(): 
	# Use main network to select the best action for the next state
	next_q_values = q_network(data.next_observations) 
	next_actions = torch.argmax(next_q_values, dim=1, keepdim=True) 
	# Use target network to evaluate the Q-value of those actions
	target_q_values = target_network(data.next_observations) 
	target_max = target_q_values.gather(1, next_actions).squeeze() 
	# Calculate TD-Target
	td_target = data.rewards.flatten() + args.gamma * target_max * (1 - data.dones.flatten())
	# Calculate current Q-value
	old_val = q_network(data.observations).gather(1, data.actions).squeeze() 
	loss = F.mse_loss(td_target, old_val)

Modern Deep Reinforcement Learning includes further improvements like Prioritized Experience Replay and Dueling Networks, which are not covered here.

Optuna

One of the most critical tasks in Deep Reinforcement Learning is finding a good set of training hyperparameters. Optuna is a library that helps you automate the search for the best hyperparameter combinations.

Policy Gradient

The previously mentioned Q-Learning and DQN are both value-based methods, which find optimal policies indirectly by estimating value functions. The existence of a policy ( $\pi$ ) depends entirely on action-value estimates, as the policy is generated from the value function (e.g., a greedy policy choosing the action with the highest value in a given state).

In contrast, with policy-based methods, we aim to directly optimize the policy, bypassing the intermediate step of learning a value function. Next, we will dive into a subset called Policy Gradient.

In policy-based methods, optimization is mostly on-policy because for each update, we only use data (action trajectories) collected by the latest version of $\pi_{\theta}$ .

Parameterized Stochastic Policy

For example, having the neural network $\pi_{\theta}$ output an action probability distribution (stochastic policy) $\pi_{\theta}(a|s)$ :

PG vs DQN

Objective function $J(\theta)$ , optimization parameters $\theta$ . Maximize the performance of the parameterized policy through gradient ascent.

Advantages

Integration Convenience

Policies can be estimated directly without storing additional data (action values); it can be understood as end-to-end.

Ability to Learn Stochastic Policies

Since the output is a probability distribution over actions, the agent can explore the state space without always following the same trajectories, eliminating the need to manually implement exploration/exploitation trade-offs. DQN learns a deterministic policy; we introduce randomness through tricks like ε-greedy, but this is not an inherent property of value-based methods. Simultaneously, it naturally handles state uncertainty, solving the problem of perceptual aliasing.

:::note{title=“Terminology”} Perceptual aliasing occurs when two states look (or are) identical but require different actions. :::

For example, in the scenario below, a vacuum cleaner agent must suck up dust while avoiding a hamster. The vacuum can only perceive the position of walls. In the diagram, these two red states are called “aliased states” because in these states, the agent perceives the same wall positions—walls both above and below. This leads to state ambiguity, making it impossible to distinguish which specific red state it is in.

Baseline

Using a deterministic policy, the vacuum would always move right or left in a red state; if it chooses the wrong direction, it gets stuck in a loop. Even with an ε-greedy policy, the vacuum mainly follows the “best” policy but might repeatedly explore the wrong direction in an aliased state, leading to inefficiency.

Monte Carlo PG

Effective in High-Dimensional and Continuous Action Spaces

Policy gradient methods are particularly effective in high-dimensional or continuous action spaces.

An autonomous car might have infinitely many action choices in each state—turning the steering wheel by 15°, 17.2°, 19.4°, or other actions like honking. Deep Q-Learning would have to calculate a Q-value for every possible action, and finding the maximum Q-value in a continuous action space is itself an optimization problem. Cover In contrast, policy gradient methods directly output an action probability distribution, eliminating the need to calculate and store Q-values for every action, making it more efficient in complex continuous action scenarios.

Better Convergence

In value-based methods, we update policies using $argmax$ to find the maximum Q-value. In this case, even a slight change in Q-values can lead to drastic changes in action selection. For instance, if the Q-value for turning left is 0.22 and then the Q-value for turning right becomes 0.23, the policy changes significantly, choosing right far more than left.

In policy gradient methods, action probabilities change smoothly over time, making the policy more stable.

Disadvantages

Local Optima

Often converges to local optima rather than the global optimum.

Low Training Efficiency

Training is slow and inefficient.

High Variance

Variance is large. We will explore the reasons and solutions for this later in Actor-Critic.

Detailed Analysis

Policy gradient adjusts parameters (the policy) with each interaction between the agent and the environment, allowing the action probability distribution to sample more “good” actions that maximize returns.

Training comparison

Objective Function

Advantage function

Our goal is to find parameters $\theta$ that maximize the expected return:

\max_{\theta} J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}} [R(\tau)]

Since this is a concave function (we want to maximize the value), we use gradient ascent: $\theta \leftarrow \theta + \alpha * \nabla_{\theta} J(\theta)$ .

However, the true gradient of the objective function is uncomputable because it requires calculating probabilities for every possible trajectory, which is computationally expensive. Therefore, we aim to calculate a gradient estimate through sample-based estimation (collecting some trajectories).

Additionally, the state transition probabilities (or state distribution) of the environment are often unknown. Even if known, they are complex and non-linear, making it impossible to directly calculate derivatives. In other words, we cannot directly differentiate the state transition dynamics (controlled by a Markov Decision Process) to optimize the policy.

Policy Gradient Theorem

A full derivation can be found on Andrej Karpathy’s blog, and I’ve also summarized my learning here: Introduction to Policy Gradient 6. PG Derivation

PG derivation

REINFORCE Algorithm (Monte Carlo Policy Gradient)

Utilizes the estimated return of an entire episode to update policy parameters $\theta$ . Collect an episode $\tau$ using policy $\pi_\theta$ , and use that episode to estimate the gradient $\hat{g} = \nabla_\theta J(\theta)$ . Optimization: $\theta \leftarrow \theta + \alpha \hat{g}$ .

Monte Carlo Reinforce

$\nabla_{\theta} \log \pi_{\theta}(a_t | s_t)$ : This part represents the gradient of the log probability of an action given a state $s_t$ and action $a_t$ . We aren’t calculating a specific action value (Q-value).
$R(\tau)$ : Here $R(\tau)$ is the cumulative return over the entire trajectory $\tau$ , used to measure the total reward after executing policy $\pi_\theta$ . A high return increases the probability of the (state, action) combination, while a low return decreases it.

We can also collect multiple episodes (trajectories) to estimate the gradient: Policy Gradient Theorem

References

CleanRL Repository