From RL to RLHF

This article is mainly based on Umar Jamil's course $^{[1]}$ for learning and recording. Our goal is to align the behavior of LLMs with our expected outputs, and RLHF is one of the most well-known techniques. Its standard process involves four models (which sounds memory-intensive, so many methods remove some models), but just remember that a total of four are needed: Reward, Actor, Critic, and Reference Model; the model we optimize at the end is the Actor Model mentioned here.

LLM to RL#

Previously, my understanding of RL was that a policy tells you the probability of the action you should take in the current state. In this sense, the language model itself can be viewed as a policy: it receives a prompt (state) and outputs the probability of the next token (action), sampling to obtain a new state (the token appended to the prompt), which is equivalent to a policy with an action space of size vocab_size, also an RL agent.

So, that means we still need something to provide rewards (which is usually an environment-built reward function in traditional RL).

Creating a "Q-A-Reward" dataset can achieve this, but humans are not good at finding consensus; however, they excel at comparing advantages. Therefore, we shift our focus: the model generates multiple answers (A) under high temperature, and then we ask domain experts (who can be human or AI models) to select the chosen/preferred answer, labeling a preference dataset to train a reward model that generates numerical rewards.

Reward Model#

This RM is implemented using a pre-trained LLM like Llama.

Note

In text generation tasks, we take the last hidden state (of the token) from the embedding (hidden states) produced by the Transformer after inputting the prompt, project it linearly into the vocabulary to obtain logits, and then use softmax and sampling strategies to select the next token.

When we want to generate numerical rewards instead of text, we can replace the linear projection to the vocabulary with a linear layer that has one output feature (outputting a scalar) to produce a single score for the entire text sequence.

Screenshot 2025-04-23 at 21.19.56

Reward Model Loss#

Tip

During training, we want this model to generate high rewards for the chosen answers and low rewards for the unchosen answers.

Similar to the parameterized form of Bradley-Terry:

$Loss = -\log \sigma(r(x, y_w) - r(x, y_l))$

$y_w$ represents chosen, and $y_l$ represents the opposite. Therefore, when the model gives a high reward to the chosen answer,

$r(x, y_w) - r(x, y_l)>0$

$\sigma(r(x, y_w) - r(x, y_{l})\in(0.5,1)$

$-\log \sigma(r(x, y_w) - r(x, y_l))\in(0,1)$

Thus, the loss will be low, while if the model gives a low reward to the chosen answer, the loss will be very high.

The RewardTrainer class in HuggingFace receives an AutoModelForSequenceClassification input (which is the model structure we mentioned above).

Screenshot 2025-04-23 at 22.02.29

Actor & Critic Model#

Trajectories#

As mentioned earlier, the core objective of reinforcement learning (RL) is to find a policy ( $\pi$ ) that can guide the agent's actions to achieve the maximum possible expected return.

Mathematically, we express this as finding the optimal policy $\pi^*$ that maximizes the objective function $J(\pi)$ :

\pi^* = \arg \max_{\pi} J(\pi)

Expected return $J(\pi)$ represents the average total return that the agent can accumulate over many possible lifetimes or episodes while following policy $\pi$ .

The calculation method is: consider all possible trajectories ( $\tau$ ) and weight the total return $R(\tau)$ of each trajectory by the probability $P(\tau|\pi)$ of that trajectory occurring under policy $\pi$ (averaging or integrating).

J(\pi) = \int P(\tau|\pi) R(\tau) = E_{\tau \sim \pi} [R(\tau)]

$E_{\tau \sim \pi}[\cdot]$ indicates the expected value when trajectory $\tau$ is generated according to policy $\pi$ .
$R(\tau)$ is the total return obtained on a single trajectory $\tau$ .
$P(\tau|\pi)$ is the probability of a specific trajectory $\tau$ occurring when the agent uses policy $\pi$ .

Trajectory $\tau$ is a sequence of states and actions experienced by the agent, starting from the initial state. It is a possible "story" or "path" of the agent's interaction with the environment.

\tau = (s_0, a_0, s_1, a_1, s_2, a_2, \dots)

$s_t$ : state at time step $t$ .
$a_t$ : action taken at time step $t$ (usually based on state $s_t$ and policy $\pi$ ).

We typically model the environment as stochastic. This means that executing the same action $a_t$ in the same state $s_t$ does not always lead to the exactly same next state $s_{t+1}$ , which involves randomness.

The next state $s_{t+1}$ is drawn from a probability distribution conditioned on the current state $s_t$ and the action taken $a_t$ :

s_{t+1} \sim P(\cdot | s_t, a_t)

Considering the stochastic state transitions and the agent's policy, we can calculate the probability of the entire trajectory occurring. It is obtained by multiplying the following components:

The probability of the agent being in the initial state $s_0$ : $p_0(s_0)$ .
For each time step $t$ $t$ in the trajectory:
- The probability of the environment transitioning to state $s_{t+1}$ given $s_t$ and $a_t$ : $P(s_{t+1}|s_t, a_t)$ .
- The probability of the agent selecting action $a_t$ in state $s_t$ according to its policy: $\pi(a_t|s_t)$ .

P(\tau|\pi) = p_0(s_0) \prod_{t=0}^{T-1} P(s_{t+1}|s_t, a_t) \pi(a_t|s_t)

(where $T$ is the length of the trajectory).

When calculating the total return $R(\tau)$ for the trajectory, we almost always use discounted rewards. This means that rewards received earlier are more valuable than those received later.

Why?

It reflects real-world scenarios (a dollar in hand today is worth more than a dollar promised tomorrow).
It avoids the problem of infinite returns in ongoing tasks (tasks without a fixed endpoint).
It provides mathematical convenience.

We introduce a discount factor ( $\gamma$ ), where $0 \le \gamma < 1$ . The closer $\gamma$ is to 0, the more "short-sighted" the agent is (focusing more on immediate benefits); the closer $\gamma$ is to 1, the more "far-sighted" the agent is (focusing more on long-term returns).

The total discounted return for the trajectory is calculated as follows:

R(\tau) = \sum_{t=0}^{\infty} \gamma^t r_t

$r_t$ is the immediate reward received at time step $t$ .
$\gamma^t$ is the discount factor applied to the reward at time step $t$ .

So, what is a trajectory in LLM? As mentioned earlier, the model is the policy, the prompt is the state, and the next token is the action; thus, these sequences of s and a in autoregressive generation constitute the trajectory.

Screenshot 2025-04-24 at 01.32.33

Policy Gradient#

We have established the goal of reinforcement learning: to find an optimal policy $π^∗$ that maximizes the expected return $J(π)$ . Great. But how do we actually represent and find this policy?

Typically, especially when dealing with complex problems, we do not search all possible policies. Instead, we define a parameterized policy, denoted as $π_θ$ . You can think of $θ$ as a set of "knobs" or parameters—if our policy is a neural network, then $θ$ might be the weights and biases of the network.

Our current goal becomes: how to adjust these knobs $θ$ to maximize our expected return?

Note

Under the parameterized policy $π_θ$ , the expected return for all possible trajectories:

$J(\pi_{\theta}) = E_{\tau \sim \pi_{\theta}} [R(\tau)]$

This means that the expected return depends on the trajectory $\tau$ , and the distribution of trajectories depends on the actions chosen by our specific policy $\pi_{\theta}$ . Changing $θ$ changes the policy, changes the trajectories, and thus changes the expected return.

Note

We want to maximize $J(\pi_{\theta})$ by changing $θ$ . In deep learning, we generally use gradient descent to minimize a loss function. Here, we want to maximize a function $J$ . Therefore, we use gradient ascent! It’s like climbing a mountain—wanting to find the steepest ascent direction (the gradient) and then taking a step in that direction.

Our policy $\pi_{\theta}$ is a neural network, and we will iteratively adjust its parameters $θ$ to increase $J(\pi_{\theta})$ . This update rule looks very familiar (just replacing the minus sign in gradient descent with a plus sign):

\theta_{k+1} = \theta_k + \alpha \nabla_{\theta} J(\pi_{\theta})|_{\theta_k}

$\theta_k$ : our parameters at the $k$ -th iteration.
$\alpha$ : learning rate (step size).
$\nabla_{\theta} J(\pi_{\theta})|_{\theta_k}$ : the gradient of expected return $J$ with respect to parameters $θ$ , calculated at the current parameters $θ_k$ . It tells us which direction in parameter space maximally increases $J$ .

Screenshot 2025-04-24 at 13.51.52

Important

PG Derivation
Here, I will start a bit verbose to reintroduce the indices, ADHD-friendly derivation...

Step 1, reiterate that the object we want the gradient of is the expected return $J(\pi_{\theta})$ :

$\nabla_{\theta} J(\pi_{\theta}) = \nabla_{\theta} E_{\tau \sim \pi_{\theta}} [R(\tau)]$

Here:

$J(\pi_{\theta})$ is the expected return.
$E_{\tau \sim \pi_{\theta}} [\cdot]$ indicates the expected value, calculated over all possible trajectories ( $\tau$ ). The trajectory $\tau$ is a series of states and actions generated by the agent interacting with the environment $(s_0, a_0, s_1, a_1, \dots)$ .
$\tau \sim \pi_{\theta}$ indicates that these trajectories are generated according to our current policy $\pi_{\theta}$ .
$R(\tau)$ refers to the total return obtained from a complete trajectory $\tau$ (usually discounted return).
$\nabla_{\theta}$ is the gradient operator, indicating that we want to take the partial derivative with respect to parameters $θ$ .

Step 2, we expand the expression for the expectation:
What is the definition of expected value? For a random variable $X$ , its expectation $E[X]$ can be calculated using its probability distribution $p(x)$ :

If it’s a continuous variable: $E[X] = \int p(x) x dx$
If it’s a discrete variable: $E[X] = \sum p(x) x$

In our case, the random variable is the return of the trajectory $R(\tau)$ , and the probability distribution is the probability of the trajectory occurring $P(\tau|\pi_{\theta})$ (the probability of trajectory $\tau$ occurring under policy $\pi_{\theta}$ ). Thus, the expectation can be expressed in integral (or summation) form:

$E_{\tau \sim \pi_{\theta}} [R(\tau)] = \int P(\tau|\pi_{\theta}) R(\tau) d\tau$

(Here, the integral symbol $\int$ represents summing or integrating over all possible trajectories, which is more general).
Substituting this into the formula from Step 1:

$\nabla_{\theta} J(\pi_{\theta}) = \nabla_{\theta} \int P(\tau|\pi_{\theta}) R(\tau) d\tau$

Step 3: Move the gradient operator inside the integral

$\nabla_{\theta} \int P(\tau|\pi_{\theta}) R(\tau) d\tau = \int \nabla_{\theta} [P(\tau|\pi_{\theta}) R(\tau)] d\tau$

This requires some knowledge of calculus: under certain conditions (which we usually assume are satisfied in reinforcement learning), we can exchange the order of differentiation and integration. Just like $\frac{d}{dx} \sum f_i(x) = \sum \frac{d}{dx} f_i(x)$ .
Next, note that $R(\tau)$ is the total return after a trajectory is determined; its value does not directly depend on the policy parameters $θ$ . (It is the policy $\pi_{\theta}$ that affects which trajectory occurs, not the return value once that trajectory occurs). Therefore, the gradient $\nabla_{\theta}$ only needs to act on $P(\tau|\pi_{\theta})$ :

$= \int [\nabla_{\theta} P(\tau|\pi_{\theta})] R(\tau) d\tau$

This step tells us that the change in expected return is due to changes in the probability of each trajectory occurring $P(\tau|\pi_{\theta})$ caused by changes in parameter $θ$ , multiplied by the return $R(\tau)$ of that trajectory, and then summed over all trajectories.

Step 4: Log-derivative trick
This is the most core and clever step in the entire derivation! We need to introduce an identity.

Calculus Review (Chain Rule and Log Derivative): Recall that the derivative of the natural logarithm $\log(x)$ (usually referring to $\ln(x)$ ) is $\frac{d}{dx} \log(f(x)) = \frac{1}{f(x)} \frac{d f(x)}{dx} = \frac{f'(x)}{f(x)}$ .
Rearranging slightly, we get: $f'(x) = f(x) \frac{d}{dx} \log(f(x))$ .
Now, we apply this trick to the gradient. Let $f(x)$ correspond to $P(\tau|\pi_{\theta})$ , and the variable $x$ correspond to the parameter $θ$ . Then:

$\nabla_{\theta} P(\tau|\pi_{\theta}) = P(\tau|\pi_{\theta}) \nabla_{\theta} \log P(\tau|\pi_{\theta})$

Substituting this result into the integral from Step 3:
$\int [\nabla_{\theta} P(\tau|\pi_{\theta})] R(\tau) d\tau = \int [P(\tau|\pi_{\theta}) \nabla_{\theta} \log P(\tau|\pi_{\theta})] R(\tau) d\tau$

Step 5: Return to expectation form
Observe the result from Step 4:

$\int P(\tau|\pi_{\theta}) [\nabla_{\theta} \log P(\tau|\pi_{\theta}) R(\tau)] d\tau$

This fits the definition of expectation! $E[f(\tau)] = \int P(\tau|\pi_{\theta}) f(\tau) d\tau$ .
Here, $f(\tau)$ corresponds to the entire content in brackets $[\nabla_{\theta} \log P(\tau|\pi_{\theta}) R(\tau)]$ .
Therefore, the entire integral can be written back in expectation form:

$= E_{\tau \sim \pi_{\theta}} [\nabla_{\theta} \log P(\tau|\pi_{\theta}) R(\tau)]$

Significant Implication! We have successfully transformed the gradient of the expectation $\nabla_{\theta} E[\cdot]$ into the expectation of some quantity (gradient multiplied by return) $E[\nabla (\cdot) \times R]$ . This form is very important because it can be approximated through sampling! We do not need to actually compute the integral over all trajectories. We just need to sample many trajectories $\tau$ , compute the value in the brackets $[\nabla_{\theta} \log P(\tau|\pi_{\theta}) R(\tau)]$ for each trajectory, and then average to obtain an approximate gradient!

Step 6: Expand the gradient of log probability (Expression for grad-log-prob)
Now, we need to handle the term $\nabla_{\theta} \log P(\tau|\pi_{\theta})$ inside the expectation.
Recall that the trajectory $\tau = (s_0, a_0, s_1, a_1, \dots, s_T, a_T)$ (assuming the trajectory length consists of T+1 states and T+1 actions, or T time steps). The probability of a trajectory occurring is:

$P(\tau|\pi_{\theta}) = p_0(s_0) \prod_{t=0}^{T} P(s_{t+1}|s_t, a_t) \pi_{\theta}(a_t|s_t)$

$p_0(s_0)$ : the probability of the initial state $s_0$ .
$P(s_{t+1}|s_t, a_t)$ : the environment dynamics, the probability of transitioning to state $s_{t+1}$ after executing action $a_t$ in state $s_t$ .
$\pi_{\theta}(a_t|s_t)$ : the policy, the probability of selecting action $a_t$ in state $s_t$ (this part depends on $θ$ ).
Mathematical Review (Log Properties): $\log(a \times b) = \log a + \log b$ and $\log(\prod_{i} x_i) = \sum_{i} \log x_i$ .
Taking the logarithm of $P(\tau|\pi_{\theta})$ :

$\log P(\tau|\pi_{\theta}) = \log p_0(s_0) + \sum_{t=0}^{T} [\log P(s_{t+1}|s_t, a_t) + \log \pi_{\theta}(a_t|s_t)]$

Now take the gradient $\nabla_{\theta}$ of the above expression:

Mathematical Review (Gradient Properties): The gradient addition rule $\nabla(f+g) = \nabla f + \nabla g$ . The gradient $\nabla_{\theta}$ only acts on terms that depend on $θ$ .

$\nabla_{\theta} \log P(\tau|\pi_{\theta}) = \nabla_{\theta} \log p_0(s_0) + \sum_{t=0}^{T} [\nabla_{\theta} \log P(s_{t+1}|s_t, a_t) + \nabla_{\theta} \log \pi_{\theta}(a_t|s_t)]$

Key Points:
- The initial state probability $\log p_0(s_0)$ typically does not depend on the policy parameter $θ$ , so $\nabla_{\theta} \log p_0(s_0) = 0$ .
- The environment dynamics $\log P(s_{t+1}|s_t, a_t)$ describe the properties of the environment itself and also do not depend on the policy parameter $θ$ , so $\nabla_{\theta} \log P(s_{t+1}|s_t, a_t) = 0$ .
- Only the policy $\log \pi_{\theta}(a_t|s_t)$ depends on $θ$ .
  Therefore, the above expression simplifies to:

$\nabla_{\theta} \log P(\tau|\pi_{\theta}) = \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t)$

The gradient of the log probability of the entire trajectory equals the sum of the log probability gradients of each action in that trajectory! This greatly simplifies the computation.

Step 7: The final policy gradient theorem
Substitute the simplified result from Step 6 back into the expectation formula from Step 5:

$\nabla_{\theta} J(\pi_{\theta}) = E_{\tau \sim \pi_{\theta}} [(\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t)) R(\tau)]$

This is the final form of the Policy Gradient Theorem (or one of its common forms).
The gradient of expected return $J(\pi_{\theta})$ with respect to parameter $θ$ equals "sample a trajectory $\tau$ , compute the total return $R(\tau)$ for that trajectory, multiply it by the sum of the log probability gradients of the policy for all (state, action) pairs in that trajectory $\nabla_{\theta} \log \pi_{\theta}(a_t|s_t)$ , and then take the expectation (average) over all possible trajectories.

Clearly, the cost of obtaining all trajectories is extremely high; for example, we need to sample all generation results with max_token_length=100, so we can approximate the expectation using the sample mean:

Note

Monte Carlo approximation: * Run the current policy $\pi_{\theta}$ , collect $N$ trajectories, forming the dataset $D = \{\tau_1, ..., \tau_N\}$ (let $N = |D|$ ).

Use the average of these samples to approximate the expected value:

$\nabla_{\theta} J(\pi_{\theta}) \approx \hat{g} = \frac{1}{|D|} \sum_{\tau \in D} [ (\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t)) R(\tau) ]$

Application to LM Policy#

By obtaining the log probabilities of each state-action pair in the sampled trajectory through the generation process shown in the figure, we can now backpropagate to calculate the gradient.

Screenshot 2025-04-24 at 15.17.06

Then, multiply each gradient by the reward from the RM and input it into the expression to perform gradient ascent optimization:

Screenshot 2025-04-24 at 15.23.14

High Variance#

PG algorithms perform well on small problems, but there are some issues when used for language modeling.

Note

The central limit theorem tells us: as long as the sample size is large enough, the sample mean will follow a normal distribution, allowing us to better predict and analyze data. When the sample size is small, the fluctuations in the sample mean can be large; even if the mean tends toward a normal distribution, the results of a single sample may vary greatly. We also know that the cost of sampling many trajectories from the LM is very high, leading to the problem of high variance in estimates.

How can we reduce variance without increasing the sample size?

Remove historical rewards: reward-to-go
It must be acknowledged that the current action cannot affect rewards already obtained in the past, and past rewards add unnecessary noise, which should relate to the credit assignment problem in RL. Therefore, removing past terms can avoid adding noise and bring the estimated gradient closer to the true gradient. Thus, instead of calculating the rewards from scratch, we can only consider the rewards of actions starting from the current time step.

Screenshot 2025-04-24 at 15.48.52

Introduce a baseline
Research in RL has confirmed that introducing a term dependent on the state (such as a function that calculates the trajectory reward, which can also be a constant) can reduce variance. Here we choose the value function $V^\pi(s)$ .

Value Function#

$V^\pi(s)$ tells you the expected reward of the remaining trajectory when acting according to the current policy.

Examples of value definitions in classic RL scenarios and LM scenarios:

Screenshot 2025-04-24 at 16.01.05

In practice, we use the LM we are trying to optimize as initialization, adding a linear layer on top to estimate the value, so that the parameters of the Transformer layer can be used for both language modeling (projecting tokens into the vocabulary) and value estimation.

Screenshot 2025-04-24 at 16.45.48

The previously mentioned reward-to-go is referred to as the Q function in RL, which is the expected reward obtained by taking this action from the current state and completing subsequent actions according to the policy:

Screenshot 2025-04-24 at 16.54.58

By introducing the value function, we obtain the difference between Q and V, which is referred to as the Advantage function.

Screenshot 2025-04-24 at 16.56.48

This $A^\pi(s_{t},a_t)$ advantage term indicates how much better this specific action is compared to the average action that can be taken in state $s$ .

Screenshot 2025-04-24 at 17.04.33

In the state pointed to by the red arrow in the figure, the advantage function for moving down will be higher than that of other actions.

$\nabla_{\theta}(J(\theta)) \approx \frac{1}{N}\sum_{i=1}^{N}\left(\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_{i,t} | s_{i,t})\right) A^{\pi}(s_{i,t}, a_{i,t})$

Multiplying the gradient by the advantage function changes the effect to increase the log probability of actions with high advantage and decrease the log probability of actions that yield low average returns.

\begin{align} \hat{A}^\pi(s_t, a_t) &= Q^\pi(s_t, a_t) - V^\pi(s_t) \\ &= [r(s_t, a_t) + \gamma V^\pi(s_{t+1})] - V^\pi(s_t) \end{align}

Note

In traditional reinforcement learning methods, Q networks and V networks are usually independent. That is, the Q function is used to estimate the expected total return of executing action $a$ in state $s$ , while the V function simply estimates the value of state $s$ . This requires two different neural networks to compute these two values separately.

However, we have now introduced the advantage function $A_{\theta}(s, a)$ , which is calculated based on the difference between Q values and V values, namely:

$A_{\theta}(s, a) = Q_{\theta}(s, a) - V_{\theta}(s)$

By expressing $A_{\theta}(s, a)$ as the difference between $Q_{\theta}(s, a)$ and $V_{\theta}(s)$ , we find that we only need to train one network to output $V_{\theta}(s)$ , and then calculate the Q value using the reward $r_t$ and discount factor $\gamma$ .

Thus, only one neural network is needed, which primarily predicts $V_{\theta}(s)$ . The Q value is calculated as follows:

$Q_{\theta}(s_t, a) = r_t + \gamma \cdot V_{\theta}(s_{t+1})$

The advantage function is further calculated as:

$A_{\theta}(s_t, a) = r_t + \gamma \cdot V_{\theta}(s_{t+1}) - V_{\theta}(s_t)$

Advantage Sampling#

Short-step advantage estimators have high bias but low variance, while long-step advantage estimators have low bias but high variance. This trade-off is a part of reinforcement learning that requires careful selection and adjustment, depending on the stability requirements of the model and training efficiency.
Screenshot 2025-05-08 at 21.53.04
An example: "A short-term memory person only remembers what happened yesterday; although not comprehensive, it is very stable; a long-term memory person can see the whole picture for the next few days but may be disturbed by more unknown factors."

GAE#

To address this bias-variance problem, we can use GAE (Generalized Advantage Estimation), which essentially is a weighted sum of all advantage terms, each multiplied by a decay factor.

Note

Now let's talk about TD error.
Online learning has a wonderful aspect: you do not need to wait until the end to update the policy. Thus, the Temporal Difference Error (TD Error) comes into play:

$\delta = r + \gamma V(s') - V(s)$

The key here is that the TD error is actually an online estimate of the advantage function. It tells you whether your action makes the future state better than you expected at this moment. This error $\delta$ directly reflects the concept of advantage:

If $\delta > 0$ : "Hey, this action is better than I imagined!" (advantage is positive).
If $\delta < 0$ : "Hmm, I thought it would be better..." (advantage is negative).

This allows you to gradually adjust your policy without waiting for an entire episode to end. This is an excellent strategy for improving efficiency.

The purpose of GAE is to provide a better estimate of the advantage function $A^π(s,a)$ in policy gradient algorithms than the original return $R(τ)$ or simple TD error $δ_t$ , to reduce the variance of gradient estimates and improve learning stability and efficiency.

\begin{align} \delta_t &= r_t + \gamma V^\pi(s_{t+1}) - V^\pi(s_t) \\ \hat{A}_t &= \delta_t + \gamma \lambda \hat{A}_{t+1} \end{align}

This formula recursively defines the generalized advantage estimate $\hat{A}_t$ . It does not only look at the one-step TD error $\delta_t$ but integrates the TD error information from multiple future steps.

This recursive formula calculates from the end of the trajectory (episode) (assuming $T$ is the last step, $\hat{A}_{T+1}=0$ ):
* $\hat{A}_T = \delta_T + 0 = \delta_T$
* $\hat{A}_{T-1} = \delta_{T-1} + \gamma \lambda \hat{A}_T = \delta_{T-1} + (\gamma \lambda) \delta_T$
* $\hat{A}_{T-2} = \delta_{T-2} + \gamma \lambda \hat{A}_{T-1} = \delta_{T-2} + (\gamma \lambda) \delta_{T-1} + (\gamma \lambda)^2 \delta_T$
* ...
* General Form: $\hat{A}_t = \sum_{k=0}^{\infty} (\gamma \lambda)^k \delta_{t+k}$ (assuming infinite steps or that $\delta$ after the terminal state is 0).

The parameter $\lambda$ ( $0 \le \lambda \le 1$ ) is key to GAE; it controls the bias and variance of the estimate $\hat{A}_t$ :

When $\lambda = 0$ :
- $\hat{A}_t = \delta_t$ . GAE degenerates into a simple one-step TD error. This estimate has lower variance (because it only relies on the information from the next step) but may have higher bias (because it heavily depends on potentially inaccurate estimates of $V^{\pi}(s_{t+1})$ ).
When $\lambda = 1$ :
- $\hat{A}_t = \sum_{k=0}^{\infty} (\gamma)^k \delta_{t+k}$ . It can be proven that this is equivalent to $\hat{A}_t = (\sum_{k=0}^{\infty} \gamma^k r_{t+k}) - V^{\pi}(s_t)$ , which is the Monte Carlo (MC) return minus the baseline. This estimate has lower bias (because it uses the complete actual return starting from time $t$ ) but usually has high variance (because it accumulates randomness from multiple time steps).
When $0 < \lambda < 1$ :
- GAE interpolates between the two extreme cases mentioned above. The closer $\lambda$ is to 0, the more it leans towards low variance and high bias TD estimates; the closer $\lambda$ is to 1, the more it leans towards high variance and low bias MC estimates.
- By choosing an appropriate $\lambda$ (e.g., 0.97), GAE attempts to achieve a good balance between bias and variance, resulting in an advantage estimate that is both relatively accurate (controllable bias) and relatively stable (lower variance).

Advantage in Language Models#

As shown in the figure, the goal is to increase the log probability of the token "Shanghai" in the current state while decreasing the log probability of "chocolate," as the advantage of choosing "Shanghai" is higher than that of choosing "chocolate" (a random token).

Screenshot 2025-04-24 at 23.13.57

Importance Sampling and Offline Learning#

In many cases, we may want to compute $E_{x \sim p(x)}[f(x)]$ , but:

It is difficult or impossible to sample $x$ directly from the target distribution $p(x)$ .
Or sampling from $p(x)$ is inefficient, which is the case in language models where LM sampling is too costly.

However, we may easily sample from another alternative, or proposal distribution $q(x)$ .

Importance Sampling (IS) is a technique for estimating the expectation of a target distribution by sampling from a different distribution, transforming an expectation calculated under the probability distribution $p(x)$ into the expectation of a related function under another different probability distribution $q(x)$ :

\begin{align} E_{x \sim p(x)}[f(x)] &= \int p(x) f(x) \, dx \\ &= \int \frac{q(x)}{q(x)} p(x) f(x) \, dx \quad \text{(assuming } q(x) \neq 0 \text{)} \\ &= \int q(x) \frac{p(x)}{q(x)} f(x) \, dx \\ &= E_{x \sim q(x)}\left[\frac{p(x)}{q(x)} f(x)\right] \end{align}

The key here is the introduction of the importance weight $w(x) = \frac{p(x)}{q(x)}$ . The role of this weight is to correct the bias: for a sample $x_i$ drawn from $q(x)$ , if its probability of occurring in the target distribution $p(x)$ is higher ( $p(x_i) > q(x_i)$ ), it will receive a weight greater than 1; conversely, if its probability in $p(x)$ is lower ( $p(x_i) < q(x_i)$ ), it will receive a weight less than 1. By averaging with these weights, we can obtain an estimate of the original expectation $E_{x \sim p(x)}[f(x)]$ that is (usually unbiased or consistent).

Importance sampling allows us to:

Draw samples $x_1, x_2, ..., x_N$ from an easily sampled distribution $q(x)$ .
Estimate the original expected value by calculating the weighted average:

E_{x \sim p(x)}[f(x)] \approx \frac{1}{N} \sum_{i=1}^{N} \frac{p(x_i)}{q(x_i)} f(x_i)

Returning to our scenario. Previously, we obtained the on-policy policy gradient estimate:

\nabla_{\theta}(J(\theta)) \approx \frac{1}{N}\sum_{i=1}^{N}\left(\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_{i,t} | s_{i,t})\right) A^{\pi}(s_{i,t}, a_{i,t})

[!info]
On-Policy means that the policy used to collect data is the same as the one used during training. Since the calculation requires using trajectories generated from the current policy $\pi_{\theta}$ , this means that after each policy update, old data cannot be directly reused, leading to low sample efficiency. As for mini_batch_num > 1, is the data used for updates still considered on-policy? Strictly speaking, it feels like it isn't, so it can also be understood as semi-on-policy? (The expression may not be rigorous).

On-Policy emphasizes whether the current policy model can interact with the environment. $^{[2]}$

We hope to utilize data generated by the old policy $\pi_{\theta_{OFFLINE}}$ to estimate the gradient of the current new policy $\pi_{\theta_{ONLINE}}$ . This way, we can reuse data and improve sample efficiency.

Recall the principle of importance sampling IS: $E_{x \sim p(x)}[f(x)] = E_{x \sim q(x)}\left[\frac{p(x)}{q(x)} f(x)\right]$ .
Corresponding to our PG (simplifying to single-step decisions):
* The target distribution $p(x)$ corresponds to the new policy $\pi_{\theta_{ONLINE}}(a|s)$ .
* The sampling distribution $q(x)$ corresponds to the old policy $\pi_{\theta_{OFFLINE}}(a|s)$ .
* The importance weight is $w_t = \frac{\pi_{\theta_{ONLINE}}(a_t|s_t)}{\pi_{\theta_{OFFLINE}}(a_t|s_t)}$ .

Applying the importance weight to each term of the on-policy gradient (for each time step $t$ ), we obtain the standard off-policy estimate:

\nabla_{\theta_{ONLINE}} (J(\theta_{ONLINE},\theta_{OFFLINE})) \approx \frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{T} \left[ \frac{\pi_{\theta_{ONLINE}}(a_{i,t}|s_{i,t})}{\pi_{\theta_{OFFLINE}}(a_{i,t}|s_{i,t})} \nabla_{\theta_{ONLINE}} \log \pi_{\theta_{ONLINE}}(a_{i,t} | s_{i,t}) A^{\pi}(s_{i,t}, a_{i,t}) \right]

Now we have found a way to perform complete gradient ascent optimization without having to sample from the policy we are optimizing (the model to be trained) each time; instead, we can sample once, save the trajectories to memory/database, optimize the policy using mini-batch, and then initialize the offline policy (the sampled policy) with the new policy.

PPO Loss#

PPO Loss mainly consists of three parts: policy loss ( $L_{POLICY}$ ), value function loss ( $L_{VF}$ ), and entropy reward ( $L_{ENTROPY}$ ).

1. Policy Loss ( $L_{POLICY}$ )#

Clipped Surrogate Objective

L_{POLICY} = \min \left( \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} \hat{A}_t, \text{clip} \left( \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}, 1-\epsilon, 1+\epsilon \right) \hat{A}_t \right)

This is the core of PPO. You will notice that it resembles the off-policy policy gradient objective derived using importance sampling, but with a key modification.

$\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ : This is the importance sampling ratio, referred to as $r_t(\theta)$ . It is the probability of taking action $a_t$ in state $s_t$ according to the current (online) policy $\pi_{\theta}$ , divided by the probability of taking that action according to the old (offline) policy $\pi_{\theta_{old}}$ when collecting trajectory data. This ratio corrects for the fact that the data comes from a policy that is slightly different from the one we are currently trying to improve.
$\hat{A}_t$ : This is the estimated advantage function, calculated using GAE, which helps balance bias and variance. It tells us how much better or worse taking action $a_t$ in state $s_t$ is compared to taking the average action in that state (as judged by the current value function).
clip function: This is the key point of PPO.
$\text{clip} \left( r_t(\theta), 1-\epsilon, 1+\epsilon \right)$
It essentially says: if the probability ratio $r_t(\theta)$ deviates too far from 1 (either too high or too low), we "clip" it. So, if $r_t(\theta)$ tries to become $1.5$ and $\epsilon$ is $0.2$ , it will be clipped to $1.2$ . If it tries to become $0.5$ , it will be clipped to $0.8$ .
The parameter $\epsilon$ (epsilon) is a small hyperparameter (e.g., 0.1 or 0.2) that defines the clipping range $[1-\epsilon, 1+\epsilon]$ .
min function: This objective function takes the smaller of the two items below:
1. Unclipped objective: $r_t(\theta) \hat{A}_t$
2. Clipped objective: $\text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t$
Why do it this way? The objective of policy gradients is to increase the probability of actions with positive advantages and decrease the probability of actions with negative advantages.
However, using importance sampling, if $r_t(\theta)$ becomes very large, it can lead to huge updates and instability. PPO tries to maintain proximity between the new policy and the old policy by clipping this ratio.
- If $\hat{A}_t > 0$ (good action): We want to increase $\pi_{\theta}(a_t|s_t)$ . The min function means that if $r_t(\theta)$ grows beyond $1+\epsilon$ , the objective function will be limited to $(1+\epsilon)\hat{A}_t$ . This prevents the policy from changing too much in a single update, even if the unclipped objective suggests a larger increment.
- If $\hat{A}_t < 0$ (bad action): We want to decrease $\pi_{\theta}(a_t|s_t)$ . If $r_t(\theta)$ shrinks below $1-\epsilon$ , the objective function will be limited to $(1-\epsilon)\hat{A}_t$ . (Note: when $\hat{A}_t < 0$ , the term $r_t(\theta)\hat{A}_t$ is larger (closer to zero or positive) when $r_t(\theta)$ is small, while $\text{clip}(...) \hat{A}_t$ is also larger when $\text{clip}(...)$ is small. The min operation effectively means that when the ratio exceeds the clipping boundaries, we take a more pessimistic update step, or one that results in a smaller change in log probability.)
  More accurately, when $\hat{A}_t < 0$ , the product $r_t(\theta)\hat{A}_t$ becomes more negative as $r_t(\theta)$ increases. The min operation ensures that if $r_t(\theta)$ deviates from the $[1-\epsilon, 1+\epsilon]$ interval, we do not let the objective become too negative (that is, we do not overly reduce the probability of that action).

2. Value Function Loss ( $L_{VF}$ )#

L_{VF} = \frac{1}{2} \left\| V_{\theta}(s) - \left( \sum_{t'=t}^{T} \gamma^{t'-t} r_{t'} \Big| s_0 = s \right) \right\|^2_2

This is exactly the same as the previous content:

$V_{\theta}(s)$ is the output of the value network (i.e., a linear layer added on top of the LLM to predict the expected cumulative reward starting from state $s$ ).
$\sum \gamma^{t'} r_{t'}$ (referred to as $G_s$ or target value) is the actual total discounted reward observed starting from state $s$ and following the current policy until the episode ends. This is the empirical target we set for $V_{\theta}(s)$ .
This loss function is the mean squared error (MSE) between the predicted value $V_{\theta}(s)$ and the observed target value $G_s$ . We want the value function to accurately predict future rewards. This value function is crucial for calculating the advantage $\hat{A}_t$ .

3. Entropy Reward ( $L_{ENTROPY}$ )#

L_{ENTROPY} = - \sum_x p(x) \log p(x)

Here, $p(x)$ (or more accurately, $\pi_{\theta}(a|s)$ , the action probability distribution output by the current policy for all possible actions $a$ given state $s$ ) represents the action probability distribution output by the current policy in the given state.
$\sum_x p(x) \log p(x)$ is the entropy of this probability distribution. Entropy measures the randomness or uncertainty of the distribution. A uniform distribution (very random) has high entropy, while a peaked distribution (very certain about a specific action) has low entropy.
The loss term is negative entropy. When we minimize this $L_{ENTROPY}$ in the total loss $L_{PPO}$ (assuming $c_2$ is positive), we are actually maximizing the entropy of the policy.

Encouraging higher entropy promotes exploration, making the policy a bit more random, trying different actions (in the case of LLM, trying different tokens), rather than quickly converging to a potentially suboptimal deterministic policy. This helps the agent discover better strategies.

Final Form $L_{PPO}$ #

The final PPO loss is the weighted sum of these three parts:

L_{PPO} = L_{POLICY} + c_1 L_{VF} + c_2 L_{ENTROPY}

$c_1 L_{VF}$ : Value function loss, weighted by $c_1$ . A common value for $c_1$ is around $0.5$ .
$c_2 L_{ENTROPY}$ : Entropy reward (if $c_2 > 0$ , it is actually a penalty for low entropy), weighted by $c_2$ . $c_2$ is usually a small normal number (e.g., $0.01$ ) to encourage exploration without overwhelming the main policy objective.

The agent's parameters (i.e., the weights of the LLM) are updated by calculating the gradient of this combined loss $L_{PPO}$ and performing gradient descent.

Reference Model#

Reward Hacking#

A major issue in RL is reward hacking, where the model may learn to always output tokens or sequences that yield good rewards but are meaningless to humans, such as repeatedly saying "thank you" to boost politeness scores. Therefore, we hope that the outputs of the aligned model (after RL post-training) are as close as possible to the original model's outputs.

Thus, there will be another model with frozen weights (reference model), and when generating rewards through the reward model at each step of the trajectory, this reward will be penalized by the KL divergence between the reference model and the optimizing model's log probabilities to prevent the model from generating answers that differ too much from the original model, thereby avoiding the aforementioned model cheating phenomenon.

Screenshot 2025-05-08 at 00.43.14

Code Walkthrough#

trl#

class AutoModelForCausalLMWithValueHead(PreTrainedModelWrapper):
    # ... (class attributes like transformers_parent_class) ...

The core purpose of this class is to bundle a standard Causal Language Model (our Actor Model, responsible for generating text policy $π_θ$ ) with a Value Head (i.e., Critic Model, responsible for estimating state value V(s)). In PPO / Actor Critic algorithms, we need both the policy and the value function simultaneously, and this class provides a unified model structure to output both.

    def __init__(self, pretrained_model, **kwargs):
        super().__init__(pretrained_model, **kwargs) # Basic setup
        v_head_kwargs, _, _ = self._split_kwargs(kwargs) # Separate parameters for ValueHead

        # Ensure the passed model has language model output capability
        if not any(hasattr(self.pretrained_model, attribute) for attribute in self.lm_head_namings):
            raise ValueError("The model does not have a language model head...")

        # Create an instance of ValueHead, which will learn to predict the value of state V(s)
        self.v_head = ValueHead(self.pretrained_model.config, **v_head_kwargs)

        # Initialize the weights of ValueHead
        self._init_weights(**v_head_kwargs) # Default random initialization, can also specify normal distribution initialization

Acting as Actor: This is our language model pretrained_model, which generates responses (actions a, i.e., a series of tokens) based on the current prompt (state s).
Critic: Evaluates how "good" the Actor is in a certain state s, outputting $V(s)$ . This is the task of the linear layer self.v_head.

    def forward(
        self,
        input_ids=None, # Input token IDs (state s)
        attention_mask=None,
        past_key_values=None, # Used to accelerate generation
        **kwargs,
    ):
        # Force the underlying model to output hidden_states, which ValueHead needs as input
        kwargs["output_hidden_states"] = True
        # ... (handle some details of past_key_values and PEFT, can be ignored for core understanding of PPO)

        # 1. Actor (base language model) performs computation
        base_model_output = self.pretrained_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            **kwargs,
        )

        # 2. Extract Actor's output (for policy update) and Critic's input
        lm_logits = base_model_output.logits # Actor's output: predicted probability distribution of the next token
        # This forms the basis for calculating L_POLICY and L_ENTROPY in PPO.

        last_hidden_state = base_model_output.hidden_states[-1] # Critic's input: the last hidden state of the LM,
        # representing the representation of the current state s.

        # (Optional) The loss of the language model itself, usually not directly used in the RL phase
        loss = base_model_output.loss

        # (Ensure data and model are on the same device)
        if last_hidden_state.device != self.v_head.summary.weight.device:
            last_hidden_state = last_hidden_state.to(self.v_head.summary.weight.device)

        # 3. Critic (ValueHead) performs computation
        # ValueHead receives the state representation and outputs the value estimate for that state V(s)
        value = self.v_head(last_hidden_state).squeeze(-1) # This forms the basis for calculating the value loss L_VF and advantage A_hat in PPO.

        # (Ensure logits are float32 for numerical stability)
        if lm_logits.dtype != torch.float32:
            lm_logits = lm_logits.float()

        # Return Actor's logits, LM loss (which may be None), and Critic's value
        return (lm_logits, loss, value)

For each step of PPO-RLHF training:

We input the current batch of prompts (sequence input_ids) into the model.
self.pretrained_model (Actor) computes (rolls out) lm_logits. These logits represent the probability distribution of the next tokens that the model believes should be generated given the current prompt. Both the policy loss $L_{POLICY}$ and the entropy reward $L_{ENTROPY}$ in PPO need to be calculated based on this probability distribution $π_θ(a_t∣s_t)$ .
Simultaneously, we extract last_hidden_state from base_model_output. This can be seen as a vector representation of the current prompt (state s).
This last_hidden_state is fed into self.v_head (Critic), outputting a scalar value. This value is the model's estimate of the value of the current state s, $V_θ(s)$ . The value function loss $L_{VF}$ in PPO aims to optimize this $V_θ(s)$ to be as close as possible to the true return. Moreover, this $V_θ(s)$ is a key component in calculating the advantage function $A^t$ , which in turn guides the calculation of $L_{POLICY}$ .
The same prompt + response sequence is input to the Reward and Reference models for inference, obtaining rewards and log probabilities (for calculating KL penalties).

Thus, a single forward call provides us with the core information needed to update both the Actor (policy) and the Critic (value function).
The training process can be understood with the help of the following diagram:

rlhf-pipeline

Tip

In RLHF, only the Actor needs to perform Prefill + Decode (complete Auto-Regressive Generation) during experience collection (rollout), while the other models only process existing responses to obtain log probabilities and values, performing only Prefill.

Additionally, the Actor involves both training and inference (referring to rollout), so it requires both a training engine (like Megatron, DeepSpeed, and FSDP) and a rollout engine (like SGLang and vLLM) to complete their respective tasks; the Critic reuses the internal representations from the training forward to output new value predictions, thus running within the same training engine; while the Reference and Reward models only need inference engines to obtain log probabilities and rewards. $^{[3]}$

verl#

Like OpenRLHF, it is an excellent RLHF framework, and a good introductory read is: 【AI Infra】VeRL Framework Introduction & Code Walkthrough

LLM to RL#

Reward Model#

Reward Model Loss#

Actor & Critic Model#

Trajectories#

Policy Gradient#

Application to LM Policy#

High Variance#

Value Function#

Advantage Sampling#

GAE#

Advantage in Language Models#

Importance Sampling and Offline Learning#

PPO Loss#

1. Policy Loss (LPOLICYL_{POLICY}LPOLICY​)#

2. Value Function Loss (LVFL_{VF}LVF​)#

3. Entropy Reward (LENTROPYL_{ENTROPY}LENTROPY​)#

Final Form LPPOL_{PPO}LPPO​#

Reference Model#

Reward Hacking#

Code Walkthrough#

trl#

verl#

Reference#

1. Policy Loss ( $L_{POLICY}$ )#

2. Value Function Loss ( $L_{VF}$ )#

3. Entropy Reward ( $L_{ENTROPY}$ )#

Final Form $L_{PPO}$ #