Menu
Avatar
The menu of my blog
Quick Stats
Quests
30 Quests
Messages
2 Messages
Playback
5 Playback
Items
6 Items
Skills
2 Skills
Trace
1 Trace
Message

The Sword Art Online Utilities Project

Welcome, traveler. This is a personal blog built in the style of the legendary SAO game interface. Navigate through the menu to explore the journal, skills, and item logs.

© 2020-2026 Nagi-ovo | RSS | Breezing
← Back to Quest Log
From RL to RLHF
From RL to RLHF

This article is primarily based on Umar Jamil's course for learning and recording purposes. Our goal is to align LLM behavior with our desired outputs, and RLHF is one of the most famous techniques for this.

May 8, 2025 50 min read
Deep LearningRLHFLLM

Human-Crafted

Written directly by the author with no AI-generated sections.

From RL to RLHF

This article is primarily based on Umar Jamil’s course[1]^{[1]}[1] for learning and documentation. Our goal is to align LLM behavior with our desired output, and RLHF is one of the most prominent techniques for achieving this. Its standard process involves four models (which sounds very VRAM-intensive, so many methods optimize by removing some models), but here we just need to remember that there are four in total: Reward, Actor, Critic, and Reference Model. The final model we optimize is the Actor Model mentioned here.

LLM to RL

In the past understanding of RL, a policy is something that tells you the probability of the Action you should take in the current State. In that sense, a language model itself can be viewed as a Policy: it receives a Prompt (state), outputs the probability of the next token (action), and after sampling, gets a new state (the token is appended to the prompt). This is equivalent to a Policy with an Action Space of vocab_size, making it an RL Agent.

So, we are still missing something to provide a Reward (in traditional RL, this is usually a reward function built into the environment).

Creating a “Q-A-Reward” dataset can achieve this, but humans are not good at finding consensus, yet they are very good at comparing quality. So we shift our direction to: generate multiple Answers from the model at High Temperature, then ask domain experts (either humans or AI Models) to select the Chosen / Preferred answer, labeling a preference dataset. We use this to train a Reward Model that generates numerical rewards.

Reward Model

This RM is implemented using a pre-trained LLM like Llama.

[!note] In text generation tasks, we take the last Hidden State (of the token) from the Embedding (Hidden States) produced after inputting the prompt into the Transformer, send it into a Linear layer to project it to the vocabulary to get logits, and then use Softmax and sampling strategies to select the next token.

When we don’t want to generate text but rather a numerical reward, we can replace the Linear layer projecting to the vocabulary with a one-output feature (outputting a scalar) Linear layer, used to generate a single score for the entire text sequence.

Screenshot 2025-04-23 at 21.19.56

Reward Model Loss

[!tip] During training, we want this model to generate high rewards for chosen answers and low rewards for rejected answers.

Similar to the Bradley-Terry parameterization form:

$$ Loss = -\log \sigma(r(x, y_w) - r(x, y_l)) $$

ywy_wyw​ represents Chosen, yly_lyl​ represents the opposite. Therefore, when the model gives a high reward for chosen,

r(x,yw)−r(x,yl)>0r(x, y_w) - r(x, y_l)>0r(x,yw​)−r(x,yl​)>0

σ(r(x,yw)−r(x,yl))∈(0.5,1)\sigma(r(x, y_w) - r(x, y_{l))}\in(0.5,1)σ(r(x,yw​)−r(x,yl))​∈(0.5,1)

−log⁡σ(r(x,yw)−r(x,yl))∈(0,1)-\log \sigma(r(x, y_w) - r(x, y_l))\in(0,1)−logσ(r(x,yw​)−r(x,yl​))∈(0,1)

Thus the loss will be low, while if the model gives a low reward for the chosen answer, the loss will be very high.

Reward Model Loss Example 1 Reward Model Loss Example 2

The RewardTrainer class in HuggingFace accepts an AutoModelForSequenceClassification input (which is the model structure mentioned above).

Screenshot 2025-04-23 at 22.02.29

Actor & Critic Model

Trajectories

As mentioned earlier, the core goal of Reinforcement Learning (RL) is to find a strategy (policy, π\piπ) that guides the agent’s actions to obtain the maximum possible expected return.

Mathematically, we represent this as finding the optimal strategy π∗\pi^*π∗ that maximizes the objective function J(π)J(\pi)J(π):

π∗=arg⁡max⁡πJ(π)\pi^* = \arg \max_{\pi} J(\pi)π∗=argπmax​J(π)

The expected return J(π)J(\pi)J(π) represents the average total return the agent is expected to accumulate over many possible lifecycles or episodes when following strategy π\piπ.

Its calculation method involves considering all possible trajectories (τ\tauτ) and calculating the weighted average (or integral) of the total return R(τ)R(\tau)R(τ) of each trajectory multiplied by the probability P(τ∣π)P(\tau|\pi)P(τ∣π) of that trajectory occurring under strategy π\piπ.

J(π)=∫P(τ∣π)R(τ)=Eτ∼π[R(τ)]J(\pi) = \int P(\tau|\pi) R(\tau) = E_{\tau \sim \pi} [R(\tau)]J(π)=∫P(τ∣π)R(τ)=Eτ∼π​[R(τ)]
  • Eτ∼π[⋅]E_{\tau \sim \pi}[\cdot]Eτ∼π​[⋅] denotes the Expected Value when the trajectory τ\tauτ is generated according to strategy π\piπ.
  • R(τ)R(\tau)R(τ) is the total return (reward) obtained on a single trajectory τ\tauτ.
  • P(τ∣π)P(\tau|\pi)P(τ∣π) is the probability that a specific trajectory τ\tauτ occurs when the agent uses strategy π\piπ.

A trajectory τ\tauτ is a sequence of states and actions experienced by the agent, starting from an initial state. It is one possible “story” or “path” of the agent’s interaction with the environment.

τ=(s0,a0,s1,a1,s2,a2,… )\tau = (s_0, a_0, s_1, a_1, s_2, a_2, \dots)τ=(s0​,a0​,s1​,a1​,s2​,a2​,…)
  • sts_tst​: State at time step ttt.
  • ata_tat​: Action taken at time step ttt (usually based on state sts_tst​ and strategy π\piπ).

We typically model the environment as stochastic. This means that executing the same action ata_tat​ in the same state sts_tst​ does not always lead to the exactly same next state st+1s_{t+1}st+1​. Randomness is involved.

The next state st+1s_{t+1}st+1​ is drawn from a probability distribution conditioned on the current state sts_tst​ and the action taken ata_tat​:

st+1∼P(⋅∣st,at)s_{t+1} \sim P(\cdot | s_t, a_t)st+1​∼P(⋅∣st​,at​)

Considering stochastic state transitions and the Agent’s strategy, we can calculate the probability of the entire trajectory occurrence. It is obtained by multiplying the following terms:

  1. Probability of the Agent being in the initial state s0s_0s0​: p0(s0)p_0(s_0)p0​(s0​).
  2. For each time step ttt in the trajectory:
    • Probability of the environment transitioning to state st+1s_{t+1}st+1​ given sts_tst​ and ata_tat​: P(st+1∣st,at)P(s_{t+1}|s_t, a_t)P(st+1​∣st​,at​).
    • Probability of the agent selecting action ata_tat​ in state sts_tst​ according to its strategy: π(at∣st)\pi(a_t|s_t)π(at​∣st​).
P(τ∣π)=p0(s0)∏t=0T−1P(st+1∣st,at)π(at∣st)P(\tau|\pi) = p_0(s_0) \prod_{t=0}^{T-1} P(s_{t+1}|s_t, a_t) \pi(a_t|s_t)P(τ∣π)=p0​(s0​)t=0∏T−1​P(st+1​∣st​,at​)π(at​∣st​)

(Where TTT is the length of the trajectory).

When calculating the total return R(τ)R(\tau)R(τ) of a trajectory, we almost always use discounted rewards. This means that rewards received earlier are more valuable than rewards received later.

Why?

  • Reflects real-world scenarios (a dollar today is worth more than a dollar tomorrow).
  • Avoids infinite return problems in continuous tasks (tasks without a fixed endpoint).
  • Provides mathematical convenience.

We introduce a discount factor γ\gammaγ, where 0≤γ<10 \le \gamma < 10≤γ<1. The closer γ\gammaγ is to 0, the more “short-sighted” the Agent is (caring more about immediate benefits); the closer γ\gammaγ is to 1, the more “far-sighted” the Agent is (caring more about long-term returns).

The total discounted return of a trajectory is calculated as follows:

R(τ)=∑t=0∞γtrtR(\tau) = \sum_{t=0}^{\infty} \gamma^t r_tR(τ)=t=0∑∞​γtrt​
  • rtr_trt​ is the immediate reward received at time step ttt.
  • γt\gamma^tγt is the discount coefficient applied to the reward at time step ttt.

So what is a trajectory in LLM? As mentioned earlier, the model is the policy, the prompt is the state, and the next token is the action, so these s, a sequences in autoregressive generation make up the trajectory.

Screenshot 2025-04-24 at 01.32.33

Policy Gradient

We have defined the goal of reinforcement learning: find an optimal strategy π∗π^∗π∗ to maximize the expected return J(π)J(π)J(π). Great. But how do we actually represent and find this strategy?

Usually, especially when dealing with complex problems, we don’t search through all possible strategies. We define a parameterized policy, denoted as πθπ_θπθ​. You can think of θθθ as a set of “knobs” or parameters—if our strategy is a neural network, then θθθ would be the weights and biases of the network.

Our goal now becomes: how to adjust these knobs θθθ to maximize our expected return?

[!note] Under strategy πθπ_θπθ​ with parameter θθθ, the expected return of all possible trajectories:

J(πθ)=Eτ∼πθ[R(τ)]J(\pi_{\theta}) = E_{\tau \sim \pi_{\theta}} [R(\tau)]J(πθ​)=Eτ∼πθ​​[R(τ)]

This means the expected return depends on the trajectory τ\tauτ, and the distribution of trajectories depends on the actions selected by our specific strategy πθ\pi_{\theta}πθ​. Changing θ\thetaθ changes the strategy, changes the trajectory, and thus changes the expected return.

[!note] We want to maximize J(πθ)J(\pi_{\theta})J(πθ​) by changing θ\thetaθ. In deep learning, gradient descent is generally used to minimize a loss function. Here, we want to maximize a function JJJ. So, we use gradient ascent conversely! It’s like climbing a hill—finding the steepest upward direction (i.e., gradient) and taking a step in that direction.

Our strategy πθ\pi_{\theta}πθ​ is a neural network, and we will iteratively adjust its parameters θ\thetaθ to increase J(πθ)J(\pi_{\theta})J(πθ​). This update rule will look very familiar (just replacing the minus sign in gradient descent with a plus sign):

θk+1=θk+α∇θJ(πθ)∣θk\theta_{k+1} = \theta_k + \alpha \nabla_{\theta} J(\pi_{\theta})|_{\theta_k}θk+1​=θk​+α∇θ​J(πθ​)∣θk​​
  • θk\theta_kθk​: Parameters at the kkk-th iteration.
  • α\alphaα: Learning rate (step size).
  • ∇θJ(πθ)∣θk\nabla_{\theta} J(\pi_{\theta})|_{\theta_k}∇θ​J(πθ​)∣θk​​: The gradient of expected return JJJ with respect to parameters θ\thetaθ, calculated at current parameters θk\theta_kθk​. It tells us which direction in the parameter space can maximize JJJ.

Screenshot 2025-04-24 at 13.51.52

[!important] PG Derivation I’ll be a bit verbose at the beginning to re-introduce all the notations, ADHD-friendly derivation…

Step 1, reiterate that the object we are requesting the gradient for is the expected return J(πθ)J(\pi_{\theta})J(πθ​):

$ \nabla{\theta} J(\pi{\theta}) = \nabla{\theta} E{\tau \sim \pi_{\theta}} [R(\tau)] $

Here:

  • J(πθ)J(\pi_{\theta})J(πθ​) is the expected return.
  • Eτ∼πθ[⋅]E_{\tau \sim \pi_{\theta}} [\cdot]Eτ∼πθ​​[⋅] denotes the expected value, calculated over all possible trajectories (τ\tauτ). Trajectory τ\tauτ is a series of states and actions generated by the interaction between the Agent and the environment (s0,a0,s1,a1,… )(s_0, a_0, s_1, a_1, \dots)(s0​,a0​,s1​,a1​,…).
  • τ∼πθ\tau \sim \pi_{\theta}τ∼πθ​ indicates that these trajectories are generated according to our current strategy πθ\pi_{\theta}πθ​.
  • R(τ)R(\tau)R(τ) refers to the total return (usually discounted return) obtained by a complete trajectory τ\tauτ.
  • ∇θ\nabla_{\theta}∇θ​ is the gradient operator, indicating we need to find the partial derivative with respect to parameter θ\thetaθ.

Step 2, let’s expand the expression of expectation: What is the definition of expected value? For a random variable XXX, its expectation E[X]E[X]E[X] can be calculated through its probability distribution p(x)p(x)p(x):

  • If it is a continuous variable: E[X]=∫p(x)xdxE[X] = \int p(x) x dxE[X]=∫p(x)xdx
  • If it is a discrete variable: E[X]=∑p(x)xE[X] = \sum p(x) xE[X]=∑p(x)x

In our example, the random variable is the trajectory return R(τ)R(\tau)R(τ), and the probability distribution is the probability of trajectory occurrence P(τ∣πθ)P(\tau|\pi_{\theta})P(τ∣πθ​) (probability of trajectory τ\tauτ occurring given strategy πθ\pi_{\theta}πθ​). So, the expectation can be written in the form of an integral (or summation):

$ E{\tau \sim \pi{\theta}} [R(\tau)] = \int P(\tau|\pi_{\theta}) R(\tau) d\tau $

(Here the integral symbol ∫\int∫ is used to represent summing or integrating over all possible trajectories, which is more general). Substitute this into the formula in Step 1:

$ \nabla{\theta} J(\pi{\theta}) = \nabla{\theta} \int P(\tau|\pi{\theta}) R(\tau) d\tau $


Step 3: Move the gradient operator inside the integral sign

$ \nabla{\theta} \int P(\tau|\pi{\theta}) R(\tau) d\tau = \int \nabla{\theta} [P(\tau|\pi{\theta}) R(\tau)] d\tau $

Here requires a bit of calculus knowledge: under certain conditions (usually assumed to be met in reinforcement learning), we can exchange the order of differentiation and integration. Just like ddx∑fi(x)=∑ddxfi(x)\frac{d}{dx} \sum f_i(x) = \sum \frac{d}{dx} f_i(x)dxd​∑fi​(x)=∑dxd​fi​(x). Next, notice that R(τ)R(\tau)R(τ) is the total return of a determined trajectory, its value itself does not directly depend on the strategy parameter θ\thetaθ. (It is the strategy πθ\pi_{\theta}πθ​ that influences which trajectory will happen, not what its return value is once this trajectory happens). So, the gradient ∇θ\nabla_{\theta}∇θ​ only needs to act on P(τ∣πθ)P(\tau|\pi_{\theta})P(τ∣πθ​):

$ = \int [\nabla{\theta} P(\tau|\pi{\theta})] R(\tau) d\tau $

This step tells us that the change in expected return is the effect of parameter θ\thetaθ changing causing the probability of each trajectory occurrence P(τ∣πθ)P(\tau|\pi_{\theta})P(τ∣πθ​) to change, multiplied by the return of that trajectory itself R(τ)R(\tau)R(τ), and then accumulated over all trajectories.

Step 4: Log-derivative trick This is the most core and ingenious step in the entire derivation! We need to introduce an identity.

  • Calculus Review (Chain Rule and Logarithmic Differentiation): Recall that the derivative of natural logarithm log⁡(x)\log(x)log(x) (usually refers to ln⁡(x)\ln(x)ln(x)) is ddxlog⁡(f(x))=1f(x)df(x)dx=f′(x)f(x)\frac{d}{dx} \log(f(x)) = \frac{1}{f(x)} \frac{d f(x)}{dx} = \frac{f'(x)}{f(x)}dxd​log(f(x))=f(x)1​dxdf(x)​=f(x)f′(x)​.
  • With a slight transformation, we get: f′(x)=f(x)ddxlog⁡(f(x))f'(x) = f(x) \frac{d}{dx} \log(f(x))f′(x)=f(x)dxd​log(f(x))。 Now, we apply this trick to the gradient. Let f(x)f(x)f(x) correspond to P(τ∣πθ)P(\tau|\pi_{\theta})P(τ∣πθ​), and independent variable xxx correspond to parameter θ\thetaθ. Then:

∇θP(τ∣πθ)=P(τ∣πθ)∇θlog⁡P(τ∣πθ)\nabla_{\theta} P(\tau|\pi_{\theta}) = P(\tau|\pi_{\theta}) \nabla_{\theta} \log P(\tau|\pi_{\theta})∇θ​P(τ∣πθ​)=P(τ∣πθ​)∇θ​logP(τ∣πθ​)

Substitute this result into the integral in Step 3: ∫[∇θP(τ∣πθ)]R(τ)dτ=∫[P(τ∣πθ)∇θlog⁡P(τ∣πθ)]R(τ)dτ\int [\nabla_{\theta} P(\tau|\pi_{\theta})] R(\tau) d\tau = \int [P(\tau|\pi_{\theta}) \nabla_{\theta} \log P(\tau|\pi_{\theta})] R(\tau) d\tau∫[∇θ​P(τ∣πθ​)]R(τ)dτ=∫[P(τ∣πθ​)∇θ​logP(τ∣πθ​)]R(τ)dτ


Step 5: Transform back to expectation form Observe the result from Step 4:

$ \int P(\tau|\pi{\theta}) [\nabla{\theta} \log P(\tau|\pi_{\theta}) R(\tau)] d\tau $

This conforms to the definition of expectation again! E[f(τ)]=∫P(τ∣πθ)f(τ)dτE[f(\tau)] = \int P(\tau|\pi_{\theta}) f(\tau) d\tauE[f(τ)]=∫P(τ∣πθ​)f(τ)dτ. Here, f(τ)f(\tau)f(τ) corresponds to everything inside the square brackets [∇θlog⁡P(τ∣πθ)R(τ)][\nabla_{\theta} \log P(\tau|\pi_{\theta}) R(\tau)][∇θ​logP(τ∣πθ​)R(τ)]. So, the entire integral can be written back in the form of expectation:

$ = E{\tau \sim \pi{\theta}} [\nabla{\theta} \log P(\tau|\pi{\theta}) R(\tau)] $

Significant Meaning! We successfully converted the gradient of expectation ∇θE[⋅]\nabla_{\theta} E[\cdot]∇θ​E[⋅] into the expectation of a certain quantity (gradient times return) E[∇(⋅)×R]E[\nabla (\cdot) \times R]E[∇(⋅)×R]. This form is very important because it can be approximated by sampling! We don’t need to actually calculate the integrals of all trajectories. We just need to sample many trajectories τ\tauτ, calculate the value in the brackets [∇θlog⁡P(τ∣πθ)R(τ)][\nabla_{\theta} \log P(\tau|\pi_{\theta}) R(\tau)][∇θ​logP(τ∣πθ​)R(τ)] for each trajectory, and then average them to get an approximation of the gradient!


Step 6: Expression for grad-log-prob Now, we need to handle the term ∇θlog⁡P(τ∣πθ)\nabla_{\theta} \log P(\tau|\pi_{\theta})∇θ​logP(τ∣πθ​) inside the expectation. Recall the trajectory τ=(s0,a0,s1,a1,…,sT,aT)\tau = (s_0, a_0, s_1, a_1, \dots, s_T, a_T)τ=(s0​,a0​,s1​,a1​,…,sT​,aT​) (assuming trajectory length is T+1 states and T+1 actions, or T time steps). The probability of a trajectory occurring is:

$ P(\tau|\pi{\theta}) = p_0(s_0) \prod{t=0}^{T} P(s{t+1}|s_t, a_t) \pi{\theta}(a_t|s_t) $

  • p0(s0)p_0(s_0)p0​(s0​): Probability of initial state s0s_0s0​.

  • P(st+1∣st,at)P(s_{t+1}|s_t, a_t)P(st+1​∣st​,at​): Environment Dynamics, probability of transitioning to state st+1s_{t+1}st+1​ after executing action ata_tat​ in state sts_tst​.

  • πθ(at∣st)\pi_{\theta}(a_t|s_t)πθ​(at​∣st​): Policy, probability of selecting action ata_tat​ in state sts_tst​ (this part depends on θ\thetaθ).

  • Math Review (Logarithmic Properties): log⁡(a×b)=log⁡a+log⁡b\log(a \times b) = \log a + \log blog(a×b)=loga+logb and log⁡(∏ixi)=∑ilog⁡xi\log(\prod_{i} x_i) = \sum_{i} \log x_ilog(∏i​xi​)=∑i​logxi​. Take the logarithm of P(τ∣πθ)P(\tau|\pi_{\theta})P(τ∣πθ​):

log⁡P(τ∣πθ)=log⁡p0(s0)+∑t=0T[log⁡P(st+1∣st,at)+log⁡πθ(at∣st)]\log P(\tau|\pi_{\theta}) = \log p_0(s_0) + \sum_{t=0}^{T} [\log P(s_{t+1}|s_t, a_t) + \log \pi_{\theta}(a_t|s_t)]logP(τ∣πθ​)=logp0​(s0​)+∑t=0T​[logP(st+1​∣st​,at​)+logπθ​(at​∣st​)]

Now calculate the gradient ∇θ\nabla_{\theta}∇θ​ of the above formula with respect to θ\thetaθ:

  • Math Review (Gradient Properties): Addition rule for gradients ∇(f+g)=∇f+∇g\nabla(f+g) = \nabla f + \nabla g∇(f+g)=∇f+∇g. Gradient ∇θ\nabla_{\theta}∇θ​ only affects terms dependent on θ\thetaθ.

  • ∇θlog⁡P(τ∣πθ)=∇θlog⁡p0(s0)+∑t=0T[∇θlog⁡P(st+1∣st,at)+∇θlog⁡πθ(at∣st)]\nabla_{\theta} \log P(\tau|\pi_{\theta}) = \nabla_{\theta} \log p_0(s_0) + \sum_{t=0}^{T} [\nabla_{\theta} \log P(s_{t+1}|s_t, a_t) + \nabla_{\theta} \log \pi_{\theta}(a_t|s_t)]∇θ​logP(τ∣πθ​)=∇θ​logp0​(s0​)+∑t=0T​[∇θ​logP(st+1​∣st​,at​)+∇θ​logπθ​(at​∣st​)]

  • Key Points:

    • Initial state probability log⁡p0(s0)\log p_0(s_0)logp0​(s0​) typically does not depend on policy parameter θ\thetaθ, so ∇θlog⁡p0(s0)=0\nabla_{\theta} \log p_0(s_0) = 0∇θ​logp0​(s0​)=0.
    • Environment dynamics log⁡P(st+1∣st,at)\log P(s_{t+1}|s_t, a_t)logP(st+1​∣st​,at​) describes the properties of the environment itself and also does not depend on policy parameter θ\thetaθ, so ∇θlog⁡P(st+1∣st,at)=0\nabla_{\theta} \log P(s_{t+1}|s_t, a_t) = 0∇θ​logP(st+1​∣st​,at​)=0.
    • Only policy log⁡πθ(at∣st)\log \pi_{\theta}(a_t|s_t)logπθ​(at​∣st​) depends on θ\thetaθ. So, the above formula simplifies to:

∇θlog⁡P(τ∣πθ)=∑t=0T∇θlog⁡πθ(at∣st)\nabla_{\theta} \log P(\tau|\pi_{\theta}) = \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t)∇θ​logP(τ∣πθ​)=∑t=0T​∇θ​logπθ​(at​∣st​)

The gradient of the log probability of the entire trajectory equals the sum of the log probability gradients of each action step in that trajectory! This greatly simplifies the calculation.


Step 7: The Final Policy Gradient Theorem Substitute the simplified result from Step 6 back into the expectation formula in Step 5:

∇θJ(πθ)=Eτ∼πθ[(∑t=0T∇θlog⁡πθ(at∣st))R(τ)]\nabla_{\theta} J(\pi_{\theta}) = E_{\tau \sim \pi_{\theta}} [(\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t)) R(\tau)]∇θ​J(πθ​)=Eτ∼πθ​​[(∑t=0T​∇θ​logπθ​(at​∣st​))R(τ)]

This is the final form (or one common form) of the Policy Gradient Theorem. The gradient of expected return J(πθ)J(\pi_{\theta})J(πθ​) with respect to parameter θ\thetaθ is equal to the expectation (average) over all possible trajectories of “sampling a trajectory τ\tauτ, calculating the total return R(τ)R(\tau)R(τ) of that trajectory, and then multiplying it by the sum of policy log probability gradients ∇θlog⁡πθ(at∣st)\nabla_{\theta} \log \pi_{\theta}(a_t|s_t)∇θ​logπθ​(at​∣st​) corresponding to all (state, action) pairs in that trajectory”.

Obviously, obtaining all trajectories is extremely costly, for example, we want to sample all generated results of max_token_length=100, so we can use sample mean to approximate expectation:

[!note] Monte Carlo Approximation: * Run current strategy πθ\pi_{\theta}πθ​, collect NNN trajectories, form dataset D={τ1,...,τN}D = \{\tau_1, ..., \tau_N\}D={τ1​,...,τN​} (let N=∣D∣N = |D|N=∣D∣).

  • Approximate expectation value with the average of these samples:
  • ∇θJ(πθ)≈g^=1∣D∣∑τ∈D[(∑t=0T∇θlog⁡πθ(at∣st))R(τ)]\nabla_{\theta} J(\pi_{\theta}) \approx \hat{g} = \frac{1}{|D|} \sum_{\tau \in D} [ (\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t)) R(\tau) ]∇θ​J(πθ​)≈g^​=∣D∣1​∑τ∈D​[(∑t=0T​∇θ​logπθ​(at​∣st​))R(τ)]

Application to LM Policy

Through the generation process shown in the figure, we obtain the log probability of each state action pair in this sampled trajectory, and now we can backpropagate to calculate the gradient.

Screenshot 2025-04-24 at 15.17.06

Then multiply each gradient by the reward from RM fed into the expression to run gradient ascent optimization:

Screenshot 2025-04-24 at 15.23.14

High Variance

PG algorithms work well for small problems, but have some issues applied to language modeling.

[!note] The central limit theorem tells us: as long as the sample is large enough, the sample mean will be normally distributed, which allows us to better predict and analyze data. When the sample size is small, the fluctuation of sample mean will be large; even if the mean tends to normal distribution, the result of a single sampling may vary greatly. And we know that the cost of sampling many trajectories from LM is very high, which leads to the high variance problem of the estimator.

How to reduce variance without increasing sample size?

  1. Remove historical rewards: reward-to-go First, we must admit that the current action cannot affect the rewards already obtained in the past, and past rewards add unnecessary noise, which should be somewhat related to the credit assignment problem in RL. Therefore, removing past terms can avoid adding noise, bringing the estimated gradient closer to the true gradient. So instead of calculating trajectory rewards from scratch, we can only consider rewards for actions starting from the current time step.

Screenshot 2025-04-24 at 15.48.52

  1. Introduce baseline RL research has confirmed that introducing a term dependent on state (such as a function calculating trajectory rewards, or a constant) can reduce variance. Here we choose Value Function Vπ(s)V^\pi(s)Vπ(s) .

Value Function

Vπ(s)V^\pi(s)Vπ(s) tells you what the expected reward for the remaining trajectory is based on the current strategy.

Examples of value definitions in classic RL scenarios and LM scenarios:

Screenshot 2025-04-24 at 16.01.05

In practice, we use the LM we are trying to optimize for initialization, adding a linear layer on top of it to estimate value, so that the parameters of Transformer layers can be used for both language modeling (using the layer projecting tokens to vocabulary) and value estimation.

Screenshot 2025-04-24 at 16.45.48

The reward-to-go mentioned earlier is called Q function in RL, which means the expected reward of starting from the current state, taking this action, getting immediate reward, and completing subsequent actions according to the strategy:

Screenshot 2025-04-24 at 16.54.58

Then by introducing Value function, we get the difference between Q and V, and this difference is called Advantage function.

Screenshot 2025-04-24 at 16.56.48

This Aπ(st,at)A^\pi(s_{t,}a_t)Aπ(st,​at​) advantage term represents how much better this specific action is relative to the average action that can be taken in state sss.

Screenshot 2025-04-24 at 17.04.33

In the figure, the advantage function for moving downward from the state pointed by the red arrow will be higher than the advantage functions of other actions.

∇θ(J(θ))≈1N∑i=1N(∑t=0T∇θlog⁡πθ(ai,t∣si,t))Aπ(si,t,ai,t)\nabla_{\theta}(J(\theta)) \approx \frac{1}{N}\sum_{i=1}^{N}\left(\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_{i,t} | s_{i,t})\right) A^{\pi}(s_{i,t}, a_{i,t})∇θ​(J(θ))≈N1​∑i=1N​(∑t=0T​∇θ​logπθ​(ai,t​∣si,t​))Aπ(si,t​,ai,t​)

After multiplying the gradient by the advantage function, the effect becomes increasing the logprob of actions with high advantage for the strategy, and decreasing the log prob of actions bringing low average return.

A^π(st,at)=Qπ(st,at)−Vπ(st)=[r(st,at)+γVπ(st+1)]−Vπ(st)\begin{align} \hat{A}^\pi(s_t, a_t) &= Q^\pi(s_t, a_t) - V^\pi(s_t) \\ &= [r(s_t, a_t) + \gamma V^\pi(s_{t+1})] - V^\pi(s_t) \end{align}A^π(st​,at​)​=Qπ(st​,at​)−Vπ(st​)=[r(st​,at​)+γVπ(st+1​)]−Vπ(st​)​​

[!note] In traditional reinforcement learning methods, Q Network and V Network are usually independent. That is, the Q function is used to estimate the total expected return of executing action aaa in state sss, while V function only estimates the value of state sss. This requires two different neural networks to calculate these two values respectively.

However, now we introduce the advantage function Aθ(s,a)A_{\theta}(s, a)Aθ​(s,a), calculated based on the difference between Q value and V value, i.e.:

Aθ(s,a)=Qθ(s,a)−Vθ(s)A_{\theta}(s, a) = Q_{\theta}(s, a) - V_{\theta}(s)Aθ​(s,a)=Qθ​(s,a)−Vθ​(s)

By expressing Aθ(s,a)A_{\theta}(s, a)Aθ​(s,a) as the difference between Qθ(s,a)Q_{\theta}(s, a)Qθ​(s,a) and Vθ(s)V_{\theta}(s)Vθ​(s), we find that we only need to train one network to output Vθ(s)V_{\theta}(s)Vθ​(s), and then calculate Q value through reward rtr_trt​ and discount factor γ\gammaγ.

Therefore, only one neural network is needed, primarily to predict Vθ(s)V_{\theta}(s)Vθ​(s). The Q value is calculated by the following formula:

Qθ(st,a)=rt+γ⋅Vθ(st+1)Q_{\theta}(s_t, a) = r_t + \gamma \cdot V_{\theta}(s_{t+1})Qθ​(st​,a)=rt​+γ⋅Vθ​(st+1​)

The advantage function is further calculated as:

Aθ(st,a)=rt+γ⋅Vθ(st+1)−Vθ(st)A_{\theta}(s_t, a) = r_t + \gamma \cdot V_{\theta}(s_{t+1}) - V_{\theta}(s_t)Aθ​(st​,a)=rt​+γ⋅Vθ​(st+1​)−Vθ​(st​)

Advantage Sampling

Short-step advantage estimators have large bias but small variance, while long-step advantage estimators have small bias but large variance. This trade-off is a part of reinforcement learning that needs careful selection and adjustment, depending on model stability requirements and training efficiency. Screenshot 2025-05-08 at 21.53.04 An example: “A person with short-term memory only remembers what happened yesterday. Although not comprehensive, it is stable; a person with long-term memory can see the full picture of the next few days, but may be interfered with by more unknown factors.”

GAE

To solve this bias-variance problem, GAE (Generalized Advantage Estimation) can be used, which is essentially a weighted sum of all advantage terms, with each term multiplied by a decay factor.

[!note] Now let’s talk about TD error Online learning has a beauty: you don’t need to wait until the end to update the strategy. So, Temporal Difference Error (TD Error) comes into play:

δ=r+γV(s′)−V(s)\delta = r + \gamma V(s') - V(s)δ=r+γV(s′)−V(s)

The key here is: TD error is actually an online estimation of the advantage function. It tells you whether your action at this moment makes the future state better than you expected. This error δ\deltaδ directly reflects the concept of advantage:

  • If δ>0\delta > 0δ>0: “Hey, this action is better than I thought!” (Positive advantage).
  • If δ<0\delta < 0δ<0: “Well, I thought it would be better…” (Negative advantage).

This allows you to adjust step by step your strategy without waiting for a whole episode to end to make changes. This is simply an excellent strategy for improving efficiency.

The purpose of GAE is to provide an estimate AtA^tAt of advantage function Aπ(s,a)A^π(s,a)Aπ(s,a) that is better than original return R(τ)R(τ)R(τ) or simple TD error δtδ_tδt​ in policy gradient algorithms, reducing gradient estimation variance and improving learning stability and efficiency.

δt=rt+γVπ(st+1)−Vπ(st)A^t=δt+γλA^t+1\begin{align} \delta_t &= r_t + \gamma V^\pi(s_{t+1}) - V^\pi(s_t) \\ \hat{A}_t &= \delta_t + \gamma \lambda \hat{A}_{t+1} \end{align}δt​A^t​​=rt​+γVπ(st+1​)−Vπ(st​)=δt​+γλA^t+1​​​

This formula recursively defines generalized advantage estimation A^t\hat{A}_tA^t​. It doesn’t just look at one-step TD error δt\delta_tδt​, but synthesizes TD error information from multiple future steps.

This recursive formula calculates backwards from the end of the trajectory (episode) (assuming TTT is the last step, A^T+1=0\hat{A}_{T+1}=0A^T+1​=0):

  • A^T=δT+0=δT\hat{A}_T = \delta_T + 0 = \delta_TA^T​=δT​+0=δT​
  • A^T−1=δT−1+γλA^T=δT−1+(γλ)δT\hat{A}_{T-1} = \delta_{T-1} + \gamma \lambda \hat{A}_T = \delta_{T-1} + (\gamma \lambda) \delta_TA^T−1​=δT−1​+γλA^T​=δT−1​+(γλ)δT​
  • A^T−2=δT−2+γλA^T−1=δT−2+(γλ)δT−1+(γλ)2δT\hat{A}_{T-2} = \delta_{T-2} + \gamma \lambda \hat{A}_{T-1} = \delta_{T-2} + (\gamma \lambda) \delta_{T-1} + (\gamma \lambda)^2 \delta_TA^T−2​=δT−2​+γλA^T−1​=δT−2​+(γλ)δT−1​+(γλ)2δT​
  • …
  • General form: A^t=∑k=0∞(γλ)kδt+k\hat{A}_t = \sum_{k=0}^{\infty} (\gamma \lambda)^k \delta_{t+k}A^t​=∑k=0∞​(γλ)kδt+k​ (assuming infinite step length or δ\deltaδ is 0 after termination state)

Parameter λ\lambdaλ (0≤λ≤10 \le \lambda \le 10≤λ≤1) is the key to GAE, controlling the bias and variance of estimation A^t\hat{A}_tA^t​:

  • When λ=0\lambda = 0λ=0:
    • A^t=δt\hat{A}_t = \delta_tA^t​=δt​. GAE degenerates into simple one-step TD error. This estimation has lower variance (because it only depends on the next step information), but may have higher bias (because it heavily relies on the possibly inaccurate estimate of Vπ(st+1)V^{\pi}(s_{t+1})Vπ(st+1​)).
  • When λ=1\lambda = 1λ=1:
    • A^t=∑k=0∞(γ)kδt+k\hat{A}_t = \sum_{k=0}^{\infty} (\gamma)^k \delta_{t+k}A^t​=∑k=0∞​(γ)kδt+k​. Through derivation, it can be proven that this is equivalent to A^t=(∑k=0∞γkrt+k)−Vπ(st)\hat{A}_t = (\sum_{k=0}^{\infty} \gamma^k r_{t+k}) - V^{\pi}(s_t)A^t​=(∑k=0∞​γkrt+k​)−Vπ(st​), which is Monte Carlo return minus baseline. This estimate has lower bias (because it uses the complete actual return starting from time ttt), but variance is usually very high (because it accumulates randomness from multiple time steps).
  • When 0<λ<10 < \lambda < 10<λ<1:
    • GAE interpolates between the above two extreme cases. The closer λ\lambdaλ is to 0, the more biased it is towards low variance high bias TD estimation; the closer λ\lambdaλ is to 1, the more biased it is towards high variance low bias MC estimation.
    • By choosing appropriate λ\lambdaλ (e.g., 0.97), GAE attempts to achieve a good balance between bias and variance, thereby obtaining an advantage estimate that is relatively accurate (controllable bias) and relatively stable (small variance).

Advantage of Language Models

As shown in the figure, the goal is to increase the logprob of token “Shanghai” in the current state and decrease the logprob of “chocolate”, because the advantage of choosing “Shanghai” is higher than the advantage of choosing “chocolate” (gibberish token).

Screenshot 2025-04-24 at 23.13.57

Importance Sampling and Offline Learning

In many cases, we may want to calculate Ex∼p(x)[f(x)]E_{x \sim p(x)}[f(x)]Ex∼p(x)​[f(x)], but:

  1. It is difficult or impossible for us to directly sample xxx from the target distribution p(x)p(x)p(x).
  2. Or sampling from p(x)p(x)p(x) is inefficient. This is the problem in language models, where LM sampling cost is too high.

However, we might be able to easily sample from another alternative, or Proposal distribution q(x)q(x)q(x).

Importance Sampling (IS) is a technique for estimating expectations of a target distribution by sampling from different distributions, converting an expected value Ex∼p(x)[f(x)]E_{x \sim p(x)}[f(x)]Ex∼p(x)​[f(x)] calculated under probability distribution p(x)p(x)p(x) into an expected value of a related function Ex∼q(x)[p(x)q(x)f(x)]E_{x \sim q(x)}\left[\frac{p(x)}{q(x)} f(x)\right]Ex∼q(x)​[q(x)p(x)​f(x)] calculated under a different probability distribution q(x)q(x)q(x).

Ex∼p(x)[f(x)]=∫p(x)f(x) dx=∫q(x)q(x)p(x)f(x) dx(Assuming q(x)≠0)=∫q(x)p(x)q(x)f(x) dx=Ex∼q(x)[p(x)q(x)f(x)]\begin{align} E_{x \sim p(x)}[f(x)] &= \int p(x) f(x) \, dx \\ &= \int \frac{q(x)}{q(x)} p(x) f(x) \, dx \quad \text{(Assuming } q(x) \neq 0 \text{)} \\ &= \int q(x) \frac{p(x)}{q(x)} f(x) \, dx \\ &= E_{x \sim q(x)}\left[\frac{p(x)}{q(x)} f(x)\right] \end{align}Ex∼p(x)​[f(x)]​=∫p(x)f(x)dx=∫q(x)q(x)​p(x)f(x)dx(Assuming q(x)=0)=∫q(x)q(x)p(x)​f(x)dx=Ex∼q(x)​[q(x)p(x)​f(x)]​​

The key here is introducing importance weight w(x)=p(x)q(x)w(x) = \frac{p(x)}{q(x)}w(x)=q(x)p(x)​. The function of this weight is bias correction: for a sample xix_ixi​ drawn from q(x)q(x)q(x), if it has a higher probability of appearing in the target distribution p(x)p(x)p(x) (p(xi)>q(xi)p(x_i) > q(x_i)p(xi​)>q(xi​)), give it a weight greater than 1; conversely, if it has a lower probability of appearing in p(x)p(x)p(x) (p(xi)<q(xi)p(x_i) < q(x_i)p(xi​)<q(xi​)), give it a weight less than 1. Weighted averaging in this way yields an (usually unbiased or consistent) estimate of the original expectation Ex∼p(x)[f(x)]E_{x \sim p(x)}[f(x)]Ex∼p(x)​[f(x)].

Importance sampling allows us to:

  1. Draw samples x1,x2,...,xNx_1, x_2, ..., x_Nx1​,x2​,...,xN​ from an easily sampled distribution q(x)q(x)q(x).
  2. Estimate original expected value by calculating weighted average:
    Ex∼p(x)[f(x)]≈1N∑i=1Np(xi)q(xi)f(xi)E_{x \sim p(x)}[f(x)] \approx \frac{1}{N} \sum_{i=1}^{N} \frac{p(x_i)}{q(x_i)} f(x_i)Ex∼p(x)​[f(x)]≈N1​i=1∑N​q(xi​)p(xi​)​f(xi​)

Back to our scenario. Previously we obtained On-Policy Policy Gradient Estimation:

∇θ(J(θ))≈1N∑i=1N(∑t=0T∇θlog⁡πθ(ai,t∣si,t))Aπ(si,t,ai,t)\nabla_{\theta}(J(\theta)) \approx \frac{1}{N}\sum_{i=1}^{N}\left(\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_{i,t} | s_{i,t})\right) A^{\pi}(s_{i,t}, a_{i,t})∇θ​(J(θ))≈N1​i=1∑N​(t=0∑T​∇θ​logπθ​(ai,t​∣si,t​))Aπ(si,t​,ai,t​)

[!info] Meaning of On-Policy: The strategy used to collect data and the strategy used for training are the same. Since calculation requires trajectories generated by sampling from current strategy πθ\pi_{\theta}πθ​. This means that after each strategy update, old data cannot be used directly, resulting in low sample efficiency. As for when mini_batch_num > 1, is the data used for subsequent updates still On-Policy? Strictly speaking, it feels like it’s not, so it can also be understood as Semi-On-Policy? (Expression implies not necessarily rigorous).

And On-Policy emphasizes whether the current strategy model can interact with the environment. [2]^{[2]}[2]

We hope to use data generated by old strategy πθOFFLINE\pi_{\theta_{OFFLINE}}πθOFFLINE​​ in the past (these data may exist in large quantities) to estimate the gradient of current new strategy πθONLINE\pi_{\theta_{ONLINE}}πθONLINE​​. This allows reusing data and improving sample efficiency.

Recall the principle of Importance Sampling IS: Ex∼p(x)[f(x)]=Ex∼q(x)[p(x)q(x)f(x)]E_{x \sim p(x)}[f(x)] = E_{x \sim q(x)}\left[\frac{p(x)}{q(x)} f(x)\right]Ex∼p(x)​[f(x)]=Ex∼q(x)​[q(x)p(x)​f(x)]. Corresponding to our PG (simplifying to consider single-step decision):

  • Target distribution p(x)p(x)p(x) corresponds to new strategy πθONLINE(a∣s)\pi_{\theta_{ONLINE}}(a|s)πθONLINE​​(a∣s).
  • Sampling distribution q(x)q(x)q(x) corresponds to old strategy πθOFFLINE(a∣s)\pi_{\theta_{OFFLINE}}(a|s)πθOFFLINE​​(a∣s).
  • Importance weight is wt=πθONLINE(at∣st)πθOFFLINE(at∣st)w_t = \frac{\pi_{\theta_{ONLINE}}(a_t|s_t)}{\pi_{\theta_{OFFLINE}}(a_t|s_t)}wt​=πθOFFLINE​​(at​∣st​)πθONLINE​​(at​∣st​)​.

Applying importance weights to each term (each time step ttt) of On-Policy gradient yields standard Off-Policy estimation:

∇θONLINE(J(θONLINE,θOFFLINE))≈1N∑i=1N∑t=0T[πθONLINE(ai,t∣si,t)πθOFFLINE(ai,t∣si,t)∇θONLINElog⁡πθONLINE(ai,t∣si,t)Aπ(si,t,ai,t)]\nabla_{\theta_{ONLINE}} (J(\theta_{ONLINE},\theta_{OFFLINE})) \approx \frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{T} \left[ \frac{\pi_{\theta_{ONLINE}}(a_{i,t}|s_{i,t})}{\pi_{\theta_{OFFLINE}}(a_{i,t}|s_{i,t})} \nabla_{\theta_{ONLINE}} \log \pi_{\theta_{ONLINE}}(a_{i,t} | s_{i,t}) A^{\pi}(s_{i,t}, a_{i,t}) \right]∇θONLINE​​(J(θONLINE​,θOFFLINE​))≈N1​i=1∑N​t=0∑T​[πθOFFLINE​​(ai,t​∣si,t​)πθONLINE​​(ai,t​∣si,t​)​∇θONLINE​​logπθONLINE​​(ai,t​∣si,t​)Aπ(si,t​,ai,t​)]

Now we have found a way to perform complete gradient ascent optimization without sampling from the strategy we are optimizing (model to be trained) every time, but sampling only once, saving the trajectory to memory/database, optimizing policy with mini-batch, and then initializing offline policy (sampled policy) with new policy.

PPO Loss

PPO Loss mainly consists of three parts: Policy Loss (LPOLICYL_{POLICY}LPOLICY​), Value Function Loss (LVFL_{VF}LVF​), and Entropy Bonus (LENTROPYL_{ENTROPY}LENTROPY​).

1. Policy Loss (LPOLICYL_{POLICY}LPOLICY​)

Clipped Surrogate Objective

LPOLICY=min⁡(πθ(at∣st)πθold(at∣st)A^t,clip(πθ(at∣st)πθold(at∣st),1−ϵ,1+ϵ)A^t)L_{POLICY} = \min \left( \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} \hat{A}_t, \text{clip} \left( \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}, 1-\epsilon, 1+\epsilon \right) \hat{A}_t \right)LPOLICY​=min(πθold​​(at​∣st​)πθ​(at​∣st​)​A^t​,clip(πθold​​(at​∣st​)πθ​(at​∣st​)​,1−ϵ,1+ϵ)A^t​)

This is the core of PPO. You will notice it looks a bit like the off-policy strategy gradient objective we derived using importance sampling earlier, but with a key modification.

  • πθ(at∣st)πθold(at∣st)\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}πθold​​(at​∣st​)πθ​(at​∣st​)​: This is the importance sampling ratio, let’s call it rt(θ)r_t(\theta)rt​(θ). It is the probability of taking action ata_tat​ in state sts_tst​ according to current (online) strategy πθ\pi_{\theta}πθ​, divided by the probability of taking that action according to old (offline) strategy πθold\pi_{\theta_{old}}πθold​​ used when collecting trajectory data. This ratio corrects for the fact that data comes from a strategy slightly different from the one we are currently trying to improve.

  • A^t\hat{A}_tA^t​: This is the advantage function estimate, calculated using GAE, which helps balance bias and variance. It tells us how much better or worse taking action ata_tat​ in state sts_tst​ is compared to taking the average action in that state (judged by current value function).

  • clip function: This is where the key point of PPO lies. clip(rt(θ),1−ϵ,1+ϵ)\text{clip} \left( r_t(\theta), 1-\epsilon, 1+\epsilon \right)clip(rt​(θ),1−ϵ,1+ϵ) It basically says: if probability ratio rt(θ)r_t(\theta)rt​(θ) deviates too far from 1 (too high or too low), we “clip” it. So, if rt(θ)r_t(\theta)rt​(θ) tries to become 1.51.51.5 and ϵ\epsilonϵ is 0.20.20.2, it will be clipped to 1.21.21.2. If it tries to become 0.50.50.5, it will be clipped to 0.80.80.8. Parameter ϵ\epsilonϵ (epsilon) is a small hyperparameter (e.g. 0.1 or 0.2) that defines clipping range [1−ϵ,1+ϵ][1-\epsilon, 1+\epsilon][1−ϵ,1+ϵ].

  • min function: This objective function takes the smaller of the following two terms:

    1. Unclipped objective: rt(θ)A^tr_t(\theta) \hat{A}_trt​(θ)A^t​

    2. Clipped objective: clip(rt(θ),1−ϵ,1+ϵ)A^t\text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_tclip(rt​(θ),1−ϵ,1+ϵ)A^t​

      Why do this? The goal of policy gradient is to increase probability of actions with positive advantage and decrease probability of actions with negative advantage. However, when using importance sampling, if rt(θ)r_t(\theta)rt​(θ) becomes very large, it may lead to huge updates and instability. PPO tries to keep the new strategy close to the old strategy by clipping this ratio.

    • If A^t>0\hat{A}_t > 0A^t​>0 (good action): We want to increase πθ(at∣st)\pi_{\theta}(a_t|s_t)πθ​(at​∣st​). The min function means if rt(θ)r_t(\theta)rt​(θ) grows beyond 1+ϵ1+\epsilon1+ϵ, the objective function will be limited to (1+ϵ)A^t(1+\epsilon)\hat{A}_t(1+ϵ)A^t​. This prevents the strategy from changing too much in a single update, even if the unclipped objective would suggest a larger increase.
    • If A^t<0\hat{A}_t < 0A^t​<0 (bad action): We want to decrease πθ(at∣st)\pi_{\theta}(a_t|s_t)πθ​(at​∣st​). If rt(θ)r_t(\theta)rt​(θ) shrinks below 1−ϵ1-\epsilon1−ϵ, the objective function will be limited to (1−ϵ)A^t(1-\epsilon)\hat{A}_t(1−ϵ)A^t​. (Note: When A^t<0\hat{A}_t < 0A^t​<0, the term rt(θ)A^tr_t(\theta)\hat{A}_trt​(θ)A^t​ is larger (closer to zero or positive) when rt(θ)r_t(\theta)rt​(θ) is smaller, and clip(...)A^t\text{clip}(...) \hat{A}_tclip(...)A^t​ is also larger when clip(...)\text{clip}(...)clip(...) is smaller. Here min actually means when ratio exceeds clipping boundary, we take the more pessimistic update step, or step that causes smaller change in log probability.) More precisely, when A^t<0\hat{A}_t < 0A^t​<0, the product rt(θ)A^tr_t(\theta)\hat{A}_trt​(θ)A^t​ will become more negative as rt(θ)r_t(\theta)rt​(θ) increases. The min operation ensures that if rt(θ)r_t(\theta)rt​(θ) deviates from [1−ϵ,1+ϵ][1-\epsilon, 1+\epsilon][1−ϵ,1+ϵ] interval, we won’t let the objective become too negative (that is, we won’t excessively lower the probability of that action).

2. Value Function Loss (LVFL_{VF}LVF​)

LVF=12∥Vθ(s)−(∑t′=tTγt′−trt′∣s0=s)∥22L_{VF} = \frac{1}{2} \left\| V_{\theta}(s) - \left( \sum_{t'=t}^{T} \gamma^{t'-t} r_{t'} \Big| s_0 = s \right) \right\|^2_2LVF​=21​​Vθ​(s)−(t′=t∑T​γt′−trt′​​s0​=s)​22​

This is exactly the same as before:

  • Vθ(s)V_{\theta}(s)Vθ​(s) is the output of the value network (i.e. adding a linear layer on top of LLM to predict expected cumulative reward starting from state sss).
  • The term ∑γt′rt′\sum \gamma^{t'} r_{t'}∑γt′rt′​ (let’s call it GsG_sGs​ or target value) is the actual sum of discounted rewards observed starting from state sss and following current strategy until end of episode. This is the empirical target we set for Vθ(s)V_{\theta}(s)Vθ​(s).
  • This loss function is the Mean Squared Error (MSE) between predicted value Vθ(s)V_{\theta}(s)Vθ​(s) and observed target value GsG_sGs​. We want the value function to accurately predict future rewards. This value function is crucial for calculating advantage A^t\hat{A}_tA^t​.

3. Entropy Bonus (LENTROPYL_{ENTROPY}LENTROPY​)

LENTROPY=−∑xp(x)log⁡p(x)L_{ENTROPY} = - \sum_x p(x) \log p(x)LENTROPY​=−x∑​p(x)logp(x)
  • Here p(x)p(x)p(x) (or more accurately πθ(a∣s)\pi_{\theta}(a|s)πθ​(a∣s), for all possible actions aaa given state sss) represents the action probability distribution output by current strategy in given state.
  • The term ∑xp(x)log⁡p(x)\sum_x p(x) \log p(x)∑x​p(x)logp(x) is the entropy of this probability distribution. Entropy measures randomness or uncertainty of distribution. Uniform distribution (very random) has high entropy, while peaked distribution (very certain about an action) has low entropy.
  • The loss term is negative entropy. When we minimize this LENTROPYL_{ENTROPY}LENTROPY​ in total loss LPPOL_{PPO}LPPO​ (assuming c2c_2c2​ is positive), we are actually maximizing the entropy of the strategy.

Encouraging higher entropy can promote exploration, making the strategy a bit more random, trying different actions (trying different tokens in LLM case), instead of converging too quickly to a possibly suboptimal deterministic strategy. This helps Agent discover better strategies.

Final Form LPPOL_{PPO}LPPO​

The final PPO loss is the weighted sum of these three parts:

LPPO=LPOLICY+c1LVF+c2LENTROPYL_{PPO} = L_{POLICY} + c_1 L_{VF} + c_2 L_{ENTROPY}LPPO​=LPOLICY​+c1​LVF​+c2​LENTROPY​
  • c1LVFc_1 L_{VF}c1​LVF​: Value function loss, weighted by c1c_1c1​. A common value for c1c_1c1​ is around 0.50.50.5.
  • c2LENTROPYc_2 L_{ENTROPY}c2​LENTROPY​: Entropy bonus (if c2>0c_2 > 0c2​>0, actually penalty for low entropy), weighted by c2c_2c2​. c2c_2c2​ is usually a small positive constant (e.g. 0.010.010.01), used to encourage exploration while not overwhelming the main policy objective.

Agent parameters (i.e. LLM weights) are updated by calculating the gradient of this combined loss LPPOL_{PPO}LPPO​ and performing gradient descent.

Reference Model

Reward Hacking

A major problem in RL is reward-hacking, where the model might learn to always output tokens or sequences that bring good rewards but make no sense to humans, such as saying “thank you” ten times in a row to boost politeness score. So we hope the aligned model’s (after RL post-training) output is as close as possible to the original model’s output.

Therefore, there will be another model with frozen weights (ref model). When the model we want to optimize generates rewards through reward model in each step of each trajectory, this reward will subtract the KL divergence between log prob of ref model and optimized model as a penalty term to prevent the model from generating answers that differ too much from the original model, thus preventing the cheating phenomenon mentioned above.

Screenshot 2025-05-08 at 00.43.14

Code walk through

trl

class AutoModelForCausalLMWithValueHead(PreTrainedModelWrapper):
    # ... (class attributes like transformers_parent_class) ...

The core purpose of this class is to bundle a standard Causal Language Model (Causal LM) (our Actor Model, strategy πθπ_θπθ​ responsible for generating text) with a Value Head (our Critic Model, responsible for estimating state value V(s)). In PPO / Actor Critic algorithms, we need both strategy and value function, and this class provides a unified model structure to output both simultaneously.

    def __init__(self, pretrained_model, **kwargs):
        super().__init__(pretrained_model, **kwargs) # Basic settings
        v_head_kwargs, _, _ = self._split_kwargs(kwargs) # Separate args for ValueHead
 
        # Ensure input uses a model with language model output capability
        if not any(hasattr(self.pretrained_model, attribute) for attribute in self.lm_head_namings):
            raise ValueError("The model does not have a language model head...")
 
        # Create ValueHead instance, which will learn to predict state value V(s)
        self.v_head = ValueHead(self.pretrained_model.config, **v_head_kwargs)
 
        # Initialize ValueHead weights
        self._init_weights(**v_head_kwargs) # Default random init, can also specify normal distribution init
  1. Acting as Actor: That is our language model pretrained_model, which generates responses (action a, i.e., a series of tokens) based on current prompt (state s).
  2. Critic: Evaluate “good/bad” of Actor in a certain state s, i.e., output V(s)V(s)V(s). This is the task of linear layer self.v_head.
    def forward(
        self,
        input_ids=None, # Input token IDs (state s)
        attention_mask=None,
        past_key_values=None, # For speeding up generation
        **kwargs,
    ):
        # Force underlying model to output hidden_states, ValueHead needs them as input
        kwargs["output_hidden_states"] = True
        # ... (some details of processing past_key_values and PEFT, can be ignored in core PPO understanding)
 
        # 1. Actor (Base Language Model) calculation
        base_model_output = self.pretrained_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            **kwargs,
        )
 
        # 2. Extract Actor output (for policy update) and Critic input
        lm_logits = base_model_output.logits # Actor output: probability distribution predicting next token
        # This is the basis for calculating L_POLICY and L_ENTROPY in PPO
 
        last_hidden_state = base_model_output.hidden_states[-1] # Critic input: hidden state of last layer of LM,
        # Represents the representation of current state s
 
        # (Optional) Language model's own loss, usually not directly used in RL stage
        loss = base_model_output.loss
 
        # (Ensure data and model are on same device)
        if last_hidden_state.device != self.v_head.summary.weight.device:
            last_hidden_state = last_hidden_state.to(self.v_head.summary.weight.device)
 
        # 3. Critic (ValueHead) calculation
        # ValueHead receives state representation, outputs value estimation V(s) for that state
        value = self.v_head(last_hidden_state).squeeze(-1) # This is the basis for calculating value loss L_VF and advantage A_hat in PPO
 
        # (Ensure logits are float32 for numerical stability)
        if lm_logits.dtype != torch.float32:
            lm_logits = lm_logits.float()
 
        # Return Actor's logits, LM loss (may be None), and Critic's value
        return (lm_logits, loss, value)

For every step of PPO-RLHF training:

  1. We input a batch of current prompts (sequence input_ids) into the model.
  2. self.pretrained_model (Actor) will calculate (Rollout) lm_logits. These logits represent the probability distribution of which tokens the model thinks should be generated next given the current prompt. PPO’s policy loss LPOLICYL_{POLICY}LPOLICY​ and entropy bonus LENTROPYL_{ENTROPY}LENTROPY​ both need to be calculated based on this probability distribution πθ(at∣st)π_θ(a_t∣s_t)πθ​(at​∣st​).
  3. At the same time, we take last_hidden_state from base_model_output. This can be seen as a vector representation of current prompt (state s).
  4. This last_hidden_state is sent into self.v_head (Critic), outputting a scalar value. This value is the model’s value estimate Vθ(s)V_θ(s)Vθ​(s) for current state s. PPO’s value function loss LVFL_{VF}LVF​ is to optimize this Vθ(s)V_θ(s)Vθ​(s) to be as close as possible to true return. And this Vθ(s)V_θ(s)Vθ​(s) is also a key component for calculating advantage function AtA^tAt, which in turn guides the calculation of LPOLICYL_{POLICY}LPOLICY​.
  5. The same prompt + response sequence is input to Reward and Reference model for inference to get reward and log probs (calculating KL penalty).

So with one forward call, we simultaneously obtain core information needed to update Actor (Strategy) and Critic (Value Function). The training flow can be understood with the help of the following diagram:

rlhf-pipeline

[!tip] In RLHF, only Actor needs Prefill + Decode (Complete Auto-Regressive Generation) during experience collection (Rollout), other models only process existing responses to get logprob and value etc., doing only Prefill.

In addition, Actor involves training and inference (referring to Rollout), so it requires training engine (such as Megatron, DeepSpeed and FSDP) + rollout engine (such as SGLang and vLLM) to complete their tasks respectively; Critic reuses internal representations in forward during training for new value prediction during inference, so it runs in same training engine; while Reference and Reward model both only use inference engine to get logprob and reward. [3]^{[3]}[3]

verl

Along with OpenRLHF etc., are excellent RLHF frameworks, a good introductory guide: [AI Infra] VeRL Framework Introduction & Code Walkthrough

Reference

  • [1] Umar Jamil: Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code. - YouTube
  • [2] [AI Infra] VeRL Framework Introduction & Code Walkthrough
  • [3] Chayenne Zhao: HybridFlow / veRL Paper Analysis
Article Info Human-Crafted
Title From RL to RLHF
Author Nagi-ovo
URL
Last Updated No edits yet
Citation

For commercial reuse, contact the site owner for authorization. For non-commercial use, please credit the source and link to this article.

You may copy, distribute, and adapt this work as long as derivatives share the same license. Licensed under CC BY-NC-SA 4.0.

Session 00:00:00