banner
Nagi-ovo

Nagi-ovo

Breezing homepage: [nagi.fun](nagi.fun)
github

"Fast Pass" PPO

Proximal Policy Optimization

Finally, we have arrived at one of the more popular RL algorithms in the NLP field in recent years.

In On-Policy algorithms, the policy used to collect data is the same as the policy being trained, which leads to the issue that data must be discarded after one use, requiring data to be collected again, resulting in slow training speeds.

The Intuition Behind PPO#

The idea of PPO is to improve the stability of policy training by limiting the changes to the policy during each training cycle: avoiding drastic policy updates.

Screenshot 2024-10-11 at 13.53.20

This is due to two reasons:

  • Based on experience in this field, smaller policy updates during training are more likely to converge to the optimal solution.
  • In policy updates, a step that is too large may lead to "falling off a cliff" (resulting in a bad policy), requiring a long time to recover, or even never returning to the original level.

Clipped Surrogate Objective Function#

Review: Policy Objective Function#

Our goal is to push the agent to choose actions that yield higher rewards by taking gradient ascent (or the negative function of gradient descent) and to avoid actions that may have negative effects.

LPG(θ)=Et[logπθ(atst)At]L^{PG}(\theta) = \mathbb{E}_t \left[ \log \pi_\theta(a_t | s_t) * A_t \right]
  1. logπθ(atst)\log \pi_\theta(a_t | s_t): The log probability of choosing action ata_t in state sts_t, indicating how likely we are to take this action under the current policy.
  2. AtA_t: Advantage function; if A>0A > 0, it indicates that this action is better than other possible actions in the current state; otherwise, it is worse.
    However, the classic PG method has a problem: the choice of step size for policy updates is crucial.
  • If the step size is too small, the training process will be very slow;
  • If the step size is too large, the volatility during training may lead to instability.

Thus, PPO proposes a new approach, the Clipped Surrogate Objective Function, which ensures that policy updates are not too aggressive by clipping the range of policy changes, thereby maintaining the stability of the training process.

The new objective function is as follows:

LCLIP(θ)=E^t[min(rt(θ)At^,clip(rt(θ),1ϵ,1+ϵ)At^)]L^{CLIP}(\theta) = \hat{\mathbb{E}}_t \left[ \min \left( r_t(\theta) \hat{A_t}, \text{clip}\left( r_t(\theta), 1 - \epsilon, 1 + \epsilon \right) \hat{A_t} \right) \right]

The Ratio Function#

The key part is the ratio function rt(θ)r_t(\theta), which represents the ratio of action probabilities between the current policy and the previous policy:

rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}

The ratio reflects the degree of deviation of the current policy from the old policy:

  • If rt(θ)(0,1)r_{t(\theta)}\in (0, 1), it indicates that the probability of choosing this action has decreased under the current policy.
  • If rt(θ)>1r_t(\theta) > 1, it indicates that action ata_t is more likely to be chosen under the current policy than before.

Unclipped Part#

The unclipped part of the formula is:

LCPI(θ)=E^t[πθ(atst)πθold(atst)A^t]=E^t[rt(θ)A^t]L^{CPI}(\theta) = \hat{\mathbb{E}}_t \left[ \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)} \hat{A}_t \right] = \hat{\mathbb{E}}_t \left[ r_t(\theta) \hat{A}_t \right]

In the unclipped objective function, rt(θ)r_t(\theta) is directly multiplied by the advantage value A^t\hat{A}_t. If action ata_t is better under the current policy than under the old policy (i.e., advantage value A^t>0\hat{A}_t > 0), we will promote this action; otherwise, we will diminish its influence. This is the standard direction for policy gradient optimization.

However, as mentioned earlier, unconstrained policy updates may lead to training instability. If the ratio rt(θ)r_t(\theta) is far greater than 1, the policy update will be too large, making it difficult to converge during training.

At this point, PPO introduces a clipping strategy, clipping the range of the ratio.

Clipping Part#

LCLIP(θ)=E^t[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{CLIP}(\theta) = \hat{\mathbb{E}}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \text{clip}\left( r_t(\theta), 1 - \epsilon, 1 + \epsilon \right) \hat{A}_t \right) \right]

Here, we see the introduction of the min\min operation. When the ratio rt(θ)r_t(\theta) exceeds the set threshold [1ϵ,1+ϵ][1 - \epsilon, 1 + \epsilon], the clipping operation will limit the ratio within this range, thus preventing excessive policy updates.

The clipping ratio function is:

clip(rt(θ),1ϵ,1+ϵ)\text{clip}\left( r_t(\theta), 1 - \epsilon, 1 + \epsilon \right)

This means that if the ratio rt(θ)r_t(\theta) exceeds the set interval (in the original paper, ϵ=0.2\epsilon = 0.2), it will be clipped between [0.8,1.2][0.8, 1.2], ensuring the stability of policy updates. We take the minimum value between the clipped and unclipped values, ensuring that the final objective function is not overly optimistic but tends toward a more conservative estimate.

Visualization#

Pasted image 20241013022437

First, remember that we take the minimum value between the clipped objective and the unclipped objective.

Case 1 and 2: Ratio Within Range#

In both cases, there is no clipping, and the policy will update according to the sign of AtA_t. This is the ideal state for PPO, where everything goes as expected.

  • Case 1: At>0A_t > 0 and pt(θ)[1ϵ,1+ϵ]p_t(\theta) \in [1 - \epsilon, 1 + \epsilon]

    • The advantage function AtA_t is positive, indicating that this action is better than expected.
    • pt(θ)p_t(\theta) is within this range, indicating that the policy change is small, and we want to encourage this action, so no clipping is performed.
    • Result: The objective function is positive, and the gradient update will push the policy further toward executing this action.
  • Case 2: At<0A_t < 0 and pt(θ)[1ϵ,1+ϵ]p_t(\theta) \in [1 - \epsilon, 1 + \epsilon]

    • The advantage function is negative, indicating that this action is worse than expected.
    • Similarly, since the ratio is within range, no clipping is performed. We want to reduce the execution of this action.
    • Result: The objective function is negative, and the gradient update will move the policy away from executing this action.

Case 3 and 4: Ratio Below Range#

Here, the ratio indicates that the current policy underestimates the probability of this action compared to the old policy. What will happen?

  • Case 3: At>0A_t > 0 and pt(θ)<1ϵp_t(\theta) < 1 - \epsilon

    • The action is good (advantage function is positive), but the new policy considers the probability of this action to be low.
    • We do not perform clipping because we want to increase the probability of this excellent action, allowing the gradient to push the update strongly.
    • Result: The objective function is positive, and the gradient encourages this action.
  • Case 4: At<0A_t < 0 and pt(θ)<1ϵp_t(\theta) < 1 - \epsilon

    • The action is poor (advantage function is negative), and the policy is already reducing the probability of this action.
    • However, we perform clipping because the probability is already below 1ϵ1 - \epsilon, and further reduction may lead to excessive punishment, causing instability in training.
    • Result: The objective function is clipped, and the gradient will no longer update, keeping the action probability at the lower limit.

Case 5 and 6: Ratio Above Range#

Here, the policy is overly confident about the action, meaning the new policy has set the execution probability of this action too high.

  • Case 5: At>0A_t > 0 and pt(θ)>1+ϵp_t(\theta) > 1 + \epsilon

    • The action is good (advantage function is positive), but the new policy overestimates its execution probability.
    • We perform clipping because we do not want the policy to overly favor this action. Even if AtA_t is positive, we need to limit the update step of the policy.
    • Result: The objective function is clipped, and the gradient does not update, limiting the change in the policy.
  • Case 6: At<0A_t < 0 and pt(θ)>1+ϵp_t(\theta) > 1 + \epsilon

    • The action is poor, but the policy has increased its execution probability. This is clearly not what we want.
    • At this point, the ratio has exceeded the range, and we do not perform clipping. The objective function is negative, and the gradient strongly pushes the policy away from this poor action.
    • Result: The objective function is negative, and the gradient will move the policy away from this action.

Why is the Gradient Zero in Clipped Cases?#

The reason is that when the ratio rt(θ)r_t(\theta) is clipped to 1ϵ1 - ϵ or 1+ϵ1 + ϵ, the derivative is no longer the derivative of the ratio rt(θ)r_t(\theta) multiplied by the advantage AtA_t, but rather the derivative of (1ϵ)At(1 - ϵ)A_t or (1+ϵ)At(1 + ϵ)A_t, and the derivatives of these two expressions are 0.

Summary#

In summary, the goal of PPO is to limit the range of changes between the current policy and the old policy through the Clipped Surrogate Objective. We remove the incentives that allow the probability ratio to exceed the interval [1ϵ,1+ϵ][1 - ϵ, 1 + ϵ] because once the ratio exceeds this interval, the gradient becomes 0, and the policy update stops.

During the PPO update process, we only update the policy in two cases:

  1. When the ratio rt(θ)r_t(\theta) falls within the interval [1ϵ,1+ϵ][1 - ϵ, 1 + ϵ].
  2. When the ratio is outside the interval, but the advantage function guides the ratio back toward the interval.

Finally, let's review that the Clipped Surrogate Objective Loss of PPO consists of three parts:

  • Clipped Surrogate Objective function: Limits the range of policy updates.
  • Value Loss Function: Used to minimize the mean squared error of the value function.
  • Entropy Bonus: Used to maintain sufficient exploration to prevent the policy from prematurely getting stuck in local optima.

These three components work together to ensure that PPO can update the policy stably while maintaining sufficient exploration.

Code Implementation#

Now let's delve into the implementation of PPO from a coding perspective. Focusing on the most critical parts of ppo.py in cleanrl, we will explain its working principle in a concise manner.

1. Structure of Policy Network and Value Network#

class Agent(nn.Module):
    def __init__(self, envs):
        super().__init__()
        # Critic network: maps states to values (uses a neural network to estimate the quality of states)
        self.critic = nn.Sequential(
            layer_init(nn.Linear(np.array(envs.single_observation_space.shape).prod(), 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 1), std=1.0), # Initialization of the last layer is crucial for learning stability
        )
        # Actor network: maps states to action probabilities (the neural network outputting the policy)
        self.actor = nn.Sequential(
            layer_init(nn.Linear(np.array(envs.single_observation_space.shape).prod(), 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, envs.single_action_space.n), std=0.01), # A smaller standard deviation ensures the initial policy is approximately uniform
        )

This is a typical dual-network architecture:

  • The actor (policy network) outputs the probability distribution of actions.
  • The critic (value network) predicts the value of states.
  • Both networks use a simple two-layer MLP structure (64-64).
  • Orthogonal initialization is used to help with training stability.

2. Implementation of GAE (Generalized Advantage Estimation)#

# GAE calculation: backward recursion to compute the advantage function and return values
with torch.no_grad():
    next_value = agent.get_value(next_obs).reshape(1, -1)
    advantages = torch.zeros_like(rewards).to(device)
    lastgaelam = 0
    for t in reversed(range(args.num_steps)):
        # Elegant implementation of GAE (Generalized Advantage Estimation)
        # Can be understood as the exponentially weighted sum of temporal difference errors
        delta = rewards[t] + args.gamma * nextvalues * nextnonterminal - values[t]
        advantages[t] = lastgaelam = delta + args.gamma * args.gae_lambda * nextnonterminal * lastgaelam
    returns = advantages + values  # Return values = Advantage function + Value estimate

This code demonstrates the recursive calculation process of GAE:

  • It computes the TD error (delta) from back to front.
  • It accumulates these TD errors in an exponentially weighted manner.
  • The gamma and lambda hyperparameters control the bias-variance trade-off of the value estimates.

3. Core Loss Function Calculation of PPO#

# The core of PPO: improving the policy while preventing excessive changes
_, newlogprob, entropy, newvalue = agent.get_action_and_value(b_obs[mb_inds], b_actions.long()[mb_inds])
ratio = (newlogprob - b_logprobs[mb_inds]).exp()  # Importance sampling ratio

# The famous PPO-Clip objective function
pg_loss1 = -mb_advantages * ratio
pg_loss2 = -mb_advantages * torch.clamp(ratio, 1 - args.clip_coef, 1 + args.clip_coef)
pg_loss = torch.max(pg_loss1, pg_loss2).mean()  # Pessimistic (worst-case) policy loss

# The value function loss also uses clipping to stay close to old predictions
if args.clip_vloss:
    v_loss_unclipped = (newvalue - b_returns[mb_inds]) ** 2
    v_clipped = b_values[mb_inds] + torch.clamp(
        newvalue - b_values[mb_inds],
        -args.clip_coef,
        args.clip_coef,
    )
    v_loss_clipped = (v_clipped - b_returns[mb_inds]) ** 2
    v_loss = 0.5 * torch.max(v_loss_unclipped, v_loss_clipped).mean()

# Comprehensive loss function: combining policy loss, value loss, and entropy reward term for exploration
loss = pg_loss - args.ent_coef * entropy_loss + v_loss * args.vf_coef

This implements the three key components mentioned earlier:

  • The Clipped Surrogate Objective is implemented using the max operation for clipping.
  • The value function loss also uses a clipping mechanism (this is a feature of OpenAI's implementation).
  • The entropy reward term is used to encourage exploration.

The IPFS platform crashed some time ago, which delayed the release of this article. After this is published, the Huggingface DeepRL series will officially conclude!

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.