From RL to RLHF

This article is primarily based on Umar Jamil’s course $^{[1]}$ for learning and documentation. Our goal is to align LLM behavior with our desired output, and RLHF is one of the most prominent techniques for achieving this. Its standard process involves four models (which sounds very VRAM-intensive, so many methods optimize by removing some models), but here we just need to remember that there are four in total: Reward, Actor, Critic, and Reference Model. The final model we optimize is the Actor Model mentioned here.

LLM to RL

In the past understanding of RL, a policy is something that tells you the probability of the Action you should take in the current State. In that sense, a language model itself can be viewed as a Policy: it receives a Prompt (state), outputs the probability of the next token (action), and after sampling, gets a new state (the token is appended to the prompt). This is equivalent to a Policy with an Action Space of vocab_size, making it an RL Agent.

So, we are still missing something to provide a Reward (in traditional RL, this is usually a reward function built into the environment).

Creating a “Q-A-Reward” dataset can achieve this, but humans are not good at finding consensus, yet they are very good at comparing quality. So we shift our direction to: generate multiple Answers from the model at High Temperature, then ask domain experts (either humans or AI Models) to select the Chosen / Preferred answer, labeling a preference dataset. We use this to train a Reward Model that generates numerical rewards.

Reward Model

This RM is implemented using a pre-trained LLM like Llama.

[!note] In text generation tasks, we take the last Hidden State (of the token) from the Embedding (Hidden States) produced after inputting the prompt into the Transformer, send it into a Linear layer to project it to the vocabulary to get logits, and then use Softmax and sampling strategies to select the next token.

When we don’t want to generate text but rather a numerical reward, we can replace the Linear layer projecting to the vocabulary with a one-output feature (outputting a scalar) Linear layer, used to generate a single score for the entire text sequence.

Screenshot 2025-04-23 at 21.19.56

Reward Model Loss

[!tip] During training, we want this model to generate high rewards for chosen answers and low rewards for rejected answers.

Similar to the Bradley-Terry parameterization form:

$$ Loss = -\log \sigma(r(x, y_w) - r(x, y_l)) $$

$y_w$ represents Chosen, $y_l$ represents the opposite. Therefore, when the model gives a high reward for chosen,

$r(x, y_w) - r(x, y_l)>0$

$\sigma(r(x, y_w) - r(x, y_{l))}\in(0.5,1)$

$-\log \sigma(r(x, y_w) - r(x, y_l))\in(0,1)$

Thus the loss will be low, while if the model gives a low reward for the chosen answer, the loss will be very high.

Reward Model Loss Example 1 Reward Model Loss Example 2

The RewardTrainer class in HuggingFace accepts an AutoModelForSequenceClassification input (which is the model structure mentioned above).

Screenshot 2025-04-23 at 22.02.29

Actor & Critic Model

Trajectories

As mentioned earlier, the core goal of Reinforcement Learning (RL) is to find a strategy (policy, $\pi$ ) that guides the agent’s actions to obtain the maximum possible expected return.

Mathematically, we represent this as finding the optimal strategy $\pi^*$ that maximizes the objective function $J(\pi)$ :

\pi^* = \arg \max_{\pi} J(\pi)

The expected return $J(\pi)$ represents the average total return the agent is expected to accumulate over many possible lifecycles or episodes when following strategy $\pi$ .

Its calculation method involves considering all possible trajectories ( $\tau$ ) and calculating the weighted average (or integral) of the total return $R(\tau)$ of each trajectory multiplied by the probability $P(\tau|\pi)$ of that trajectory occurring under strategy $\pi$ .

J(\pi) = \int P(\tau|\pi) R(\tau) = E_{\tau \sim \pi} [R(\tau)]

$E_{\tau \sim \pi}[\cdot]$ denotes the Expected Value when the trajectory $\tau$ is generated according to strategy $\pi$ .
$R(\tau)$ is the total return (reward) obtained on a single trajectory $\tau$ .
$P(\tau|\pi)$ is the probability that a specific trajectory $\tau$ occurs when the agent uses strategy $\pi$ .

A trajectory $\tau$ is a sequence of states and actions experienced by the agent, starting from an initial state. It is one possible “story” or “path” of the agent’s interaction with the environment.

\tau = (s_0, a_0, s_1, a_1, s_2, a_2, \dots)

$s_t$ : State at time step $t$ .
$a_t$ : Action taken at time step $t$ (usually based on state $s_t$ and strategy $\pi$ ).

We typically model the environment as stochastic. This means that executing the same action $a_t$ in the same state $s_t$ does not always lead to the exactly same next state $s_{t+1}$ . Randomness is involved.

The next state $s_{t+1}$ is drawn from a probability distribution conditioned on the current state $s_t$ and the action taken $a_t$ :

s_{t+1} \sim P(\cdot | s_t, a_t)

Considering stochastic state transitions and the Agent’s strategy, we can calculate the probability of the entire trajectory occurrence. It is obtained by multiplying the following terms:

Probability of the Agent being in the initial state $s_0$ : $p_0(s_0)$ .
For each time step $t$ in the trajectory:
- Probability of the environment transitioning to state $s_{t+1}$ given $s_t$ and $a_t$ : $P(s_{t+1}|s_t, a_t)$ .
- Probability of the agent selecting action $a_t$ in state $s_t$ according to its strategy: $\pi(a_t|s_t)$ .

P(\tau|\pi) = p_0(s_0) \prod_{t=0}^{T-1} P(s_{t+1}|s_t, a_t) \pi(a_t|s_t)

(Where $T$ is the length of the trajectory).

When calculating the total return $R(\tau)$ of a trajectory, we almost always use discounted rewards. This means that rewards received earlier are more valuable than rewards received later.

Why?

Reflects real-world scenarios (a dollar today is worth more than a dollar tomorrow).
Avoids infinite return problems in continuous tasks (tasks without a fixed endpoint).
Provides mathematical convenience.

We introduce a discount factor $\gamma$ , where $0 \le \gamma < 1$ . The closer $\gamma$ is to 0, the more “short-sighted” the Agent is (caring more about immediate benefits); the closer $\gamma$ is to 1, the more “far-sighted” the Agent is (caring more about long-term returns).

The total discounted return of a trajectory is calculated as follows:

R(\tau) = \sum_{t=0}^{\infty} \gamma^t r_t

$r_t$ is the immediate reward received at time step $t$ .
$\gamma^t$ is the discount coefficient applied to the reward at time step $t$ .

So what is a trajectory in LLM? As mentioned earlier, the model is the policy, the prompt is the state, and the next token is the action, so these s, a sequences in autoregressive generation make up the trajectory.

Screenshot 2025-04-24 at 01.32.33

Policy Gradient

We have defined the goal of reinforcement learning: find an optimal strategy $π^∗$ to maximize the expected return $J(π)$ . Great. But how do we actually represent and find this strategy?

Usually, especially when dealing with complex problems, we don’t search through all possible strategies. We define a parameterized policy, denoted as $π_θ$ . You can think of $θ$ as a set of “knobs” or parameters—if our strategy is a neural network, then $θ$ would be the weights and biases of the network.

Our goal now becomes: how to adjust these knobs $θ$ to maximize our expected return?

[!note] Under strategy $π_θ$ with parameter $θ$ , the expected return of all possible trajectories:

$J(\pi_{\theta}) = E_{\tau \sim \pi_{\theta}} [R(\tau)]$

This means the expected return depends on the trajectory $\tau$ , and the distribution of trajectories depends on the actions selected by our specific strategy $\pi_{\theta}$ . Changing $\theta$ changes the strategy, changes the trajectory, and thus changes the expected return.

[!note] We want to maximize $J(\pi_{\theta})$ by changing $\theta$ . In deep learning, gradient descent is generally used to minimize a loss function. Here, we want to maximize a function $J$ . So, we use gradient ascent conversely! It’s like climbing a hill—finding the steepest upward direction (i.e., gradient) and taking a step in that direction.

Our strategy $\pi_{\theta}$ is a neural network, and we will iteratively adjust its parameters $\theta$ to increase $J(\pi_{\theta})$ . This update rule will look very familiar (just replacing the minus sign in gradient descent with a plus sign):

\theta_{k+1} = \theta_k + \alpha \nabla_{\theta} J(\pi_{\theta})|_{\theta_k}

$\theta_k$ : Parameters at the $k$ -th iteration.
$\alpha$ : Learning rate (step size).
$\nabla_{\theta} J(\pi_{\theta})|_{\theta_k}$ : The gradient of expected return $J$ with respect to parameters $\theta$ , calculated at current parameters $\theta_k$ . It tells us which direction in the parameter space can maximize $J$ .

Screenshot 2025-04-24 at 13.51.52

[!important] PG Derivation I’ll be a bit verbose at the beginning to re-introduce all the notations, ADHD-friendly derivation…

Step 1, reiterate that the object we are requesting the gradient for is the expected return $J(\pi_{\theta})$ :

$ \nabla{\theta} J(\pi{\theta}) = \nabla{\theta} E{\tau \sim \pi_{\theta}} [R(\tau)] $

Here:

$J(\pi_{\theta})$ is the expected return.

$E_{\tau \sim \pi_{\theta}} [\cdot]$ denotes the expected value, calculated over all possible trajectories ( $\tau$ ). Trajectory $\tau$ is a series of states and actions generated by the interaction between the Agent and the environment $(s_0, a_0, s_1, a_1, \dots)$ .

$\tau \sim \pi_{\theta}$ indicates that these trajectories are generated according to our current strategy $\pi_{\theta}$ .

$R(\tau)$ refers to the total return (usually discounted return) obtained by a complete trajectory $\tau$ .

$\nabla_{\theta}$ is the gradient operator, indicating we need to find the partial derivative with respect to parameter $\theta$ .

Step 2, let’s expand the expression of expectation: What is the definition of expected value? For a random variable $X$ , its expectation $E[X]$ can be calculated through its probability distribution $p(x)$ :

If it is a continuous variable: $E[X] = \int p(x) x dx$

If it is a discrete variable: $E[X] = \sum p(x) x$

In our example, the random variable is the trajectory return $R(\tau)$ , and the probability distribution is the probability of trajectory occurrence $P(\tau|\pi_{\theta})$ (probability of trajectory $\tau$ occurring given strategy $\pi_{\theta}$ ). So, the expectation can be written in the form of an integral (or summation):

$ E{\tau \sim \pi{\theta}} [R(\tau)] = \int P(\tau|\pi_{\theta}) R(\tau) d\tau $

(Here the integral symbol $\int$ is used to represent summing or integrating over all possible trajectories, which is more general). Substitute this into the formula in Step 1:

$ \nabla{\theta} J(\pi{\theta}) = \nabla{\theta} \int P(\tau|\pi{\theta}) R(\tau) d\tau $

Step 3: Move the gradient operator inside the integral sign

$ \nabla{\theta} \int P(\tau|\pi{\theta}) R(\tau) d\tau = \int \nabla{\theta} [P(\tau|\pi{\theta}) R(\tau)] d\tau $

Here requires a bit of calculus knowledge: under certain conditions (usually assumed to be met in reinforcement learning), we can exchange the order of differentiation and integration. Just like $\frac{d}{dx} \sum f_i(x) = \sum \frac{d}{dx} f_i(x)$ . Next, notice that $R(\tau)$ is the total return of a determined trajectory, its value itself does not directly depend on the strategy parameter $\theta$ . (It is the strategy $\pi_{\theta}$ that influences which trajectory will happen, not what its return value is once this trajectory happens). So, the gradient $\nabla_{\theta}$ only needs to act on $P(\tau|\pi_{\theta})$ :

$ = \int [\nabla{\theta} P(\tau|\pi{\theta})] R(\tau) d\tau $

This step tells us that the change in expected return is the effect of parameter $\theta$ changing causing the probability of each trajectory occurrence $P(\tau|\pi_{\theta})$ to change, multiplied by the return of that trajectory itself $R(\tau)$ , and then accumulated over all trajectories.

Step 4: Log-derivative trick This is the most core and ingenious step in the entire derivation! We need to introduce an identity.

Calculus Review (Chain Rule and Logarithmic Differentiation): Recall that the derivative of natural logarithm $\log(x)$ (usually refers to $\ln(x)$ ) is $\frac{d}{dx} \log(f(x)) = \frac{1}{f(x)} \frac{d f(x)}{dx} = \frac{f'(x)}{f(x)}$ .

With a slight transformation, we get: $f'(x) = f(x) \frac{d}{dx} \log(f(x))$ 。 Now, we apply this trick to the gradient. Let $f(x)$ correspond to $P(\tau|\pi_{\theta})$ , and independent variable $x$ correspond to parameter $\theta$ . Then:

$\nabla_{\theta} P(\tau|\pi_{\theta}) = P(\tau|\pi_{\theta}) \nabla_{\theta} \log P(\tau|\pi_{\theta})$

Substitute this result into the integral in Step 3: $\int [\nabla_{\theta} P(\tau|\pi_{\theta})] R(\tau) d\tau = \int [P(\tau|\pi_{\theta}) \nabla_{\theta} \log P(\tau|\pi_{\theta})] R(\tau) d\tau$

Step 5: Transform back to expectation form Observe the result from Step 4:

$ \int P(\tau|\pi{\theta}) [\nabla{\theta} \log P(\tau|\pi_{\theta}) R(\tau)] d\tau $

This conforms to the definition of expectation again! $E[f(\tau)] = \int P(\tau|\pi_{\theta}) f(\tau) d\tau$ . Here, $f(\tau)$ corresponds to everything inside the square brackets $[\nabla_{\theta} \log P(\tau|\pi_{\theta}) R(\tau)]$ . So, the entire integral can be written back in the form of expectation:

$ = E{\tau \sim \pi{\theta}} [\nabla{\theta} \log P(\tau|\pi{\theta}) R(\tau)] $

Significant Meaning! We successfully converted the gradient of expectation $\nabla_{\theta} E[\cdot]$ into the expectation of a certain quantity (gradient times return) $E[\nabla (\cdot) \times R]$ . This form is very important because it can be approximated by sampling! We don’t need to actually calculate the integrals of all trajectories. We just need to sample many trajectories $\tau$ , calculate the value in the brackets $[\nabla_{\theta} \log P(\tau|\pi_{\theta}) R(\tau)]$ for each trajectory, and then average them to get an approximation of the gradient!

Step 6: Expression for grad-log-prob Now, we need to handle the term $\nabla_{\theta} \log P(\tau|\pi_{\theta})$ inside the expectation. Recall the trajectory $\tau = (s_0, a_0, s_1, a_1, \dots, s_T, a_T)$ (assuming trajectory length is T+1 states and T+1 actions, or T time steps). The probability of a trajectory occurring is:

$ P(\tau|\pi{\theta}) = p_0(s_0) \prod{t=0}^{T} P(s{t+1}|s_t, a_t) \pi{\theta}(a_t|s_t) $

$p_0(s_0)$ : Probability of initial state $s_0$ .

$P(s_{t+1}|s_t, a_t)$ : Environment Dynamics, probability of transitioning to state $s_{t+1}$ after executing action $a_t$ in state $s_t$ .

$\pi_{\theta}(a_t|s_t)$ : Policy, probability of selecting action $a_t$ in state $s_t$ (this part depends on $\theta$ ).

Math Review (Logarithmic Properties): $\log(a \times b) = \log a + \log b$ and $\log(\prod_{i} x_i) = \sum_{i} \log x_i$ . Take the logarithm of $P(\tau|\pi_{\theta})$ :

$\log P(\tau|\pi_{\theta}) = \log p_0(s_0) + \sum_{t=0}^{T} [\log P(s_{t+1}|s_t, a_t) + \log \pi_{\theta}(a_t|s_t)]$

Now calculate the gradient $\nabla_{\theta}$ of the above formula with respect to $\theta$ :

Math Review (Gradient Properties): Addition rule for gradients $\nabla(f+g) = \nabla f + \nabla g$ . Gradient $\nabla_{\theta}$ only affects terms dependent on $\theta$ .

$\nabla_{\theta} \log P(\tau|\pi_{\theta}) = \nabla_{\theta} \log p_0(s_0) + \sum_{t=0}^{T} [\nabla_{\theta} \log P(s_{t+1}|s_t, a_t) + \nabla_{\theta} \log \pi_{\theta}(a_t|s_t)]$

Key Points:

Initial state probability $\log p_0(s_0)$ typically does not depend on policy parameter $\theta$ , so $\nabla_{\theta} \log p_0(s_0) = 0$ .

Environment dynamics $\log P(s_{t+1}|s_t, a_t)$ describes the properties of the environment itself and also does not depend on policy parameter $\theta$ , so $\nabla_{\theta} \log P(s_{t+1}|s_t, a_t) = 0$ .

Only policy $\log \pi_{\theta}(a_t|s_t)$ depends on $\theta$ . So, the above formula simplifies to:

$\nabla_{\theta} \log P(\tau|\pi_{\theta}) = \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t)$

The gradient of the log probability of the entire trajectory equals the sum of the log probability gradients of each action step in that trajectory! This greatly simplifies the calculation.

Step 7: The Final Policy Gradient Theorem Substitute the simplified result from Step 6 back into the expectation formula in Step 5:

$\nabla_{\theta} J(\pi_{\theta}) = E_{\tau \sim \pi_{\theta}} [(\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t)) R(\tau)]$

This is the final form (or one common form) of the Policy Gradient Theorem. The gradient of expected return $J(\pi_{\theta})$ with respect to parameter $\theta$ is equal to the expectation (average) over all possible trajectories of “sampling a trajectory $\tau$ , calculating the total return $R(\tau)$ of that trajectory, and then multiplying it by the sum of policy log probability gradients $\nabla_{\theta} \log \pi_{\theta}(a_t|s_t)$ corresponding to all (state, action) pairs in that trajectory”.

Obviously, obtaining all trajectories is extremely costly, for example, we want to sample all generated results of max_token_length=100, so we can use sample mean to approximate expectation:

[!note] Monte Carlo Approximation: * Run current strategy $\pi_{\theta}$ , collect $N$ trajectories, form dataset $D = \{\tau_1, ..., \tau_N\}$ (let $N = |D|$ ).

Approximate expectation value with the average of these samples:

$\nabla_{\theta} J(\pi_{\theta}) \approx \hat{g} = \frac{1}{|D|} \sum_{\tau \in D} [ (\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t)) R(\tau) ]$

Application to LM Policy

Through the generation process shown in the figure, we obtain the log probability of each state action pair in this sampled trajectory, and now we can backpropagate to calculate the gradient.

Screenshot 2025-04-24 at 15.17.06

Then multiply each gradient by the reward from RM fed into the expression to run gradient ascent optimization:

Screenshot 2025-04-24 at 15.23.14

High Variance

PG algorithms work well for small problems, but have some issues applied to language modeling.

[!note] The central limit theorem tells us: as long as the sample is large enough, the sample mean will be normally distributed, which allows us to better predict and analyze data. When the sample size is small, the fluctuation of sample mean will be large; even if the mean tends to normal distribution, the result of a single sampling may vary greatly. And we know that the cost of sampling many trajectories from LM is very high, which leads to the high variance problem of the estimator.

How to reduce variance without increasing sample size?

Remove historical rewards: reward-to-go First, we must admit that the current action cannot affect the rewards already obtained in the past, and past rewards add unnecessary noise, which should be somewhat related to the credit assignment problem in RL. Therefore, removing past terms can avoid adding noise, bringing the estimated gradient closer to the true gradient. So instead of calculating trajectory rewards from scratch, we can only consider rewards for actions starting from the current time step.

Screenshot 2025-04-24 at 15.48.52

Introduce baseline RL research has confirmed that introducing a term dependent on state (such as a function calculating trajectory rewards, or a constant) can reduce variance. Here we choose Value Function $V^\pi(s)$ .

Value Function

$V^\pi(s)$ tells you what the expected reward for the remaining trajectory is based on the current strategy.

Examples of value definitions in classic RL scenarios and LM scenarios:

Screenshot 2025-04-24 at 16.01.05

In practice, we use the LM we are trying to optimize for initialization, adding a linear layer on top of it to estimate value, so that the parameters of Transformer layers can be used for both language modeling (using the layer projecting tokens to vocabulary) and value estimation.

Screenshot 2025-04-24 at 16.45.48

The reward-to-go mentioned earlier is called Q function in RL, which means the expected reward of starting from the current state, taking this action, getting immediate reward, and completing subsequent actions according to the strategy:

Screenshot 2025-04-24 at 16.54.58

Then by introducing Value function, we get the difference between Q and V, and this difference is called Advantage function.

Screenshot 2025-04-24 at 16.56.48

This $A^\pi(s_{t,}a_t)$ advantage term represents how much better this specific action is relative to the average action that can be taken in state $s$ .

Screenshot 2025-04-24 at 17.04.33

In the figure, the advantage function for moving downward from the state pointed by the red arrow will be higher than the advantage functions of other actions.

$\nabla_{\theta}(J(\theta)) \approx \frac{1}{N}\sum_{i=1}^{N}\left(\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_{i,t} | s_{i,t})\right) A^{\pi}(s_{i,t}, a_{i,t})$

After multiplying the gradient by the advantage function, the effect becomes increasing the logprob of actions with high advantage for the strategy, and decreasing the log prob of actions bringing low average return.

\begin{align} \hat{A}^\pi(s_t, a_t) &= Q^\pi(s_t, a_t) - V^\pi(s_t) \\ &= [r(s_t, a_t) + \gamma V^\pi(s_{t+1})] - V^\pi(s_t) \end{align}

[!note] In traditional reinforcement learning methods, Q Network and V Network are usually independent. That is, the Q function is used to estimate the total expected return of executing action $a$ in state $s$ , while V function only estimates the value of state $s$ . This requires two different neural networks to calculate these two values respectively.

However, now we introduce the advantage function $A_{\theta}(s, a)$ , calculated based on the difference between Q value and V value, i.e.:

$A_{\theta}(s, a) = Q_{\theta}(s, a) - V_{\theta}(s)$

By expressing $A_{\theta}(s, a)$ as the difference between $Q_{\theta}(s, a)$ and $V_{\theta}(s)$ , we find that we only need to train one network to output $V_{\theta}(s)$ , and then calculate Q value through reward $r_t$ and discount factor $\gamma$ .

Therefore, only one neural network is needed, primarily to predict $V_{\theta}(s)$ . The Q value is calculated by the following formula:

$Q_{\theta}(s_t, a) = r_t + \gamma \cdot V_{\theta}(s_{t+1})$

The advantage function is further calculated as:

$A_{\theta}(s_t, a) = r_t + \gamma \cdot V_{\theta}(s_{t+1}) - V_{\theta}(s_t)$

Advantage Sampling

Short-step advantage estimators have large bias but small variance, while long-step advantage estimators have small bias but large variance. This trade-off is a part of reinforcement learning that needs careful selection and adjustment, depending on model stability requirements and training efficiency. An example: “A person with short-term memory only remembers what happened yesterday. Although not comprehensive, it is stable; a person with long-term memory can see the full picture of the next few days, but may be interfered with by more unknown factors.”

GAE

To solve this bias-variance problem, GAE (Generalized Advantage Estimation) can be used, which is essentially a weighted sum of all advantage terms, with each term multiplied by a decay factor.

[!note] Now let’s talk about TD error Online learning has a beauty: you don’t need to wait until the end to update the strategy. So, Temporal Difference Error (TD Error) comes into play:

$\delta = r + \gamma V(s') - V(s)$

The key here is: TD error is actually an online estimation of the advantage function. It tells you whether your action at this moment makes the future state better than you expected. This error $\delta$ directly reflects the concept of advantage:

If $\delta > 0$ : “Hey, this action is better than I thought!” (Positive advantage).

If $\delta < 0$ : “Well, I thought it would be better…” (Negative advantage).

This allows you to adjust step by step your strategy without waiting for a whole episode to end to make changes. This is simply an excellent strategy for improving efficiency.

The purpose of GAE is to provide an estimate $A^t$ of advantage function $A^π(s,a)$ that is better than original return $R(τ)$ or simple TD error $δ_t$ in policy gradient algorithms, reducing gradient estimation variance and improving learning stability and efficiency.

\begin{align} \delta_t &= r_t + \gamma V^\pi(s_{t+1}) - V^\pi(s_t) \\ \hat{A}_t &= \delta_t + \gamma \lambda \hat{A}_{t+1} \end{align}

This formula recursively defines generalized advantage estimation $\hat{A}_t$ . It doesn’t just look at one-step TD error $\delta_t$ , but synthesizes TD error information from multiple future steps.

This recursive formula calculates backwards from the end of the trajectory (episode) (assuming $T$ is the last step, $\hat{A}_{T+1}=0$ ):

$\hat{A}_T = \delta_T + 0 = \delta_T$
$\hat{A}_{T-1} = \delta_{T-1} + \gamma \lambda \hat{A}_T = \delta_{T-1} + (\gamma \lambda) \delta_T$
$\hat{A}_{T-2} = \delta_{T-2} + \gamma \lambda \hat{A}_{T-1} = \delta_{T-2} + (\gamma \lambda) \delta_{T-1} + (\gamma \lambda)^2 \delta_T$
…
General form: $\hat{A}_t = \sum_{k=0}^{\infty} (\gamma \lambda)^k \delta_{t+k}$ (assuming infinite step length or $\delta$ is 0 after termination state)

Parameter $\lambda$ ( $0 \le \lambda \le 1$ ) is the key to GAE, controlling the bias and variance of estimation $\hat{A}_t$ :

When $\lambda = 0$ :
- $\hat{A}_t = \delta_t$ . GAE degenerates into simple one-step TD error. This estimation has lower variance (because it only depends on the next step information), but may have higher bias (because it heavily relies on the possibly inaccurate estimate of $V^{\pi}(s_{t+1})$ ).
When $\lambda = 1$ :
- $\hat{A}_t = \sum_{k=0}^{\infty} (\gamma)^k \delta_{t+k}$ . Through derivation, it can be proven that this is equivalent to $\hat{A}_t = (\sum_{k=0}^{\infty} \gamma^k r_{t+k}) - V^{\pi}(s_t)$ , which is Monte Carlo return minus baseline. This estimate has lower bias (because it uses the complete actual return starting from time $t$ ), but variance is usually very high (because it accumulates randomness from multiple time steps).
When $0 < \lambda < 1$ :
- GAE interpolates between the above two extreme cases. The closer $\lambda$ is to 0, the more biased it is towards low variance high bias TD estimation; the closer $\lambda$ is to 1, the more biased it is towards high variance low bias MC estimation.
- By choosing appropriate $\lambda$ (e.g., 0.97), GAE attempts to achieve a good balance between bias and variance, thereby obtaining an advantage estimate that is relatively accurate (controllable bias) and relatively stable (small variance).

Advantage of Language Models

As shown in the figure, the goal is to increase the logprob of token “Shanghai” in the current state and decrease the logprob of “chocolate”, because the advantage of choosing “Shanghai” is higher than the advantage of choosing “chocolate” (gibberish token).

Screenshot 2025-04-24 at 23.13.57

Importance Sampling and Offline Learning

In many cases, we may want to calculate $E_{x \sim p(x)}[f(x)]$ , but:

It is difficult or impossible for us to directly sample $x$ from the target distribution $p(x)$ .
Or sampling from $p(x)$ is inefficient. This is the problem in language models, where LM sampling cost is too high.

However, we might be able to easily sample from another alternative, or Proposal distribution $q(x)$ .

Importance Sampling (IS) is a technique for estimating expectations of a target distribution by sampling from different distributions, converting an expected value $E_{x \sim p(x)}[f(x)]$ calculated under probability distribution $p(x)$ into an expected value of a related function $E_{x \sim q(x)}\left[\frac{p(x)}{q(x)} f(x)\right]$ calculated under a different probability distribution $q(x)$ .

\begin{align} E_{x \sim p(x)}[f(x)] &= \int p(x) f(x) \, dx \\ &= \int \frac{q(x)}{q(x)} p(x) f(x) \, dx \quad \text{(Assuming } q(x) \neq 0 \text{)} \\ &= \int q(x) \frac{p(x)}{q(x)} f(x) \, dx \\ &= E_{x \sim q(x)}\left[\frac{p(x)}{q(x)} f(x)\right] \end{align}

The key here is introducing importance weight $w(x) = \frac{p(x)}{q(x)}$ . The function of this weight is bias correction: for a sample $x_i$ drawn from $q(x)$ , if it has a higher probability of appearing in the target distribution $p(x)$ ( $p(x_i) > q(x_i)$ ), give it a weight greater than 1; conversely, if it has a lower probability of appearing in $p(x)$ ( $p(x_i) < q(x_i)$ ), give it a weight less than 1. Weighted averaging in this way yields an (usually unbiased or consistent) estimate of the original expectation $E_{x \sim p(x)}[f(x)]$ .

Importance sampling allows us to:

Draw samples $x_1, x_2, ..., x_N$ from an easily sampled distribution $q(x)$ .
Estimate original expected value by calculating weighted average: $E_{x \sim p(x)}[f(x)] \approx \frac{1}{N} \sum_{i=1}^{N} \frac{p(x_i)}{q(x_i)} f(x_i)$

Back to our scenario. Previously we obtained On-Policy Policy Gradient Estimation:

\nabla_{\theta}(J(\theta)) \approx \frac{1}{N}\sum_{i=1}^{N}\left(\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_{i,t} | s_{i,t})\right) A^{\pi}(s_{i,t}, a_{i,t})

[!info] Meaning of On-Policy: The strategy used to collect data and the strategy used for training are the same. Since calculation requires trajectories generated by sampling from current strategy $\pi_{\theta}$ . This means that after each strategy update, old data cannot be used directly, resulting in low sample efficiency. As for when mini_batch_num > 1, is the data used for subsequent updates still On-Policy? Strictly speaking, it feels like it’s not, so it can also be understood as Semi-On-Policy? (Expression implies not necessarily rigorous).

And On-Policy emphasizes whether the current strategy model can interact with the environment. $^{[2]}$

We hope to use data generated by old strategy $\pi_{\theta_{OFFLINE}}$ in the past (these data may exist in large quantities) to estimate the gradient of current new strategy $\pi_{\theta_{ONLINE}}$ . This allows reusing data and improving sample efficiency.

Recall the principle of Importance Sampling IS: $E_{x \sim p(x)}[f(x)] = E_{x \sim q(x)}\left[\frac{p(x)}{q(x)} f(x)\right]$ . Corresponding to our PG (simplifying to consider single-step decision):

Target distribution $p(x)$ corresponds to new strategy $\pi_{\theta_{ONLINE}}(a|s)$ .
Sampling distribution $q(x)$ corresponds to old strategy $\pi_{\theta_{OFFLINE}}(a|s)$ .
Importance weight is $w_t = \frac{\pi_{\theta_{ONLINE}}(a_t|s_t)}{\pi_{\theta_{OFFLINE}}(a_t|s_t)}$ .

Applying importance weights to each term (each time step $t$ ) of On-Policy gradient yields standard Off-Policy estimation:

\nabla_{\theta_{ONLINE}} (J(\theta_{ONLINE},\theta_{OFFLINE})) \approx \frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{T} \left[ \frac{\pi_{\theta_{ONLINE}}(a_{i,t}|s_{i,t})}{\pi_{\theta_{OFFLINE}}(a_{i,t}|s_{i,t})} \nabla_{\theta_{ONLINE}} \log \pi_{\theta_{ONLINE}}(a_{i,t} | s_{i,t}) A^{\pi}(s_{i,t}, a_{i,t}) \right]

Now we have found a way to perform complete gradient ascent optimization without sampling from the strategy we are optimizing (model to be trained) every time, but sampling only once, saving the trajectory to memory/database, optimizing policy with mini-batch, and then initializing offline policy (sampled policy) with new policy.

PPO Loss

PPO Loss mainly consists of three parts: Policy Loss ( $L_{POLICY}$ ), Value Function Loss ( $L_{VF}$ ), and Entropy Bonus ( $L_{ENTROPY}$ ).

1. Policy Loss ( $L_{POLICY}$ )

Clipped Surrogate Objective

L_{POLICY} = \min \left( \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} \hat{A}_t, \text{clip} \left( \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}, 1-\epsilon, 1+\epsilon \right) \hat{A}_t \right)

This is the core of PPO. You will notice it looks a bit like the off-policy strategy gradient objective we derived using importance sampling earlier, but with a key modification.

$\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ : This is the importance sampling ratio, let’s call it $r_t(\theta)$ . It is the probability of taking action $a_t$ in state $s_t$ according to current (online) strategy $\pi_{\theta}$ , divided by the probability of taking that action according to old (offline) strategy $\pi_{\theta_{old}}$ used when collecting trajectory data. This ratio corrects for the fact that data comes from a strategy slightly different from the one we are currently trying to improve.
$\hat{A}_t$ : This is the advantage function estimate, calculated using GAE, which helps balance bias and variance. It tells us how much better or worse taking action $a_t$ in state $s_t$ is compared to taking the average action in that state (judged by current value function).
clip function: This is where the key point of PPO lies. $\text{clip} \left( r_t(\theta), 1-\epsilon, 1+\epsilon \right)$ It basically says: if probability ratio $r_t(\theta)$ deviates too far from 1 (too high or too low), we “clip” it. So, if $r_t(\theta)$ tries to become $1.5$ and $\epsilon$ is $0.2$ , it will be clipped to $1.2$ . If it tries to become $0.5$ , it will be clipped to $0.8$ . Parameter $\epsilon$ (epsilon) is a small hyperparameter (e.g. 0.1 or 0.2) that defines clipping range $[1-\epsilon, 1+\epsilon]$ .
min function: This objective function takes the smaller of the following two terms:
1. Unclipped objective: $r_t(\theta) \hat{A}_t$
2. Clipped objective: $\text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t$
  
  Why do this? The goal of policy gradient is to increase probability of actions with positive advantage and decrease probability of actions with negative advantage. However, when using importance sampling, if $r_t(\theta)$ becomes very large, it may lead to huge updates and instability. PPO tries to keep the new strategy close to the old strategy by clipping this ratio.
- If $\hat{A}_t > 0$ (good action): We want to increase $\pi_{\theta}(a_t|s_t)$ . The min function means if $r_t(\theta)$ grows beyond $1+\epsilon$ , the objective function will be limited to $(1+\epsilon)\hat{A}_t$ . This prevents the strategy from changing too much in a single update, even if the unclipped objective would suggest a larger increase.
- If $\hat{A}_t < 0$ (bad action): We want to decrease $\pi_{\theta}(a_t|s_t)$ . If $r_t(\theta)$ shrinks below $1-\epsilon$ , the objective function will be limited to $(1-\epsilon)\hat{A}_t$ . (Note: When $\hat{A}_t < 0$ , the term $r_t(\theta)\hat{A}_t$ is larger (closer to zero or positive) when $r_t(\theta)$ is smaller, and $\text{clip}(...) \hat{A}_t$ is also larger when $\text{clip}(...)$ is smaller. Here min actually means when ratio exceeds clipping boundary, we take the more pessimistic update step, or step that causes smaller change in log probability.) More precisely, when $\hat{A}_t < 0$ , the product $r_t(\theta)\hat{A}_t$ will become more negative as $r_t(\theta)$ increases. The min operation ensures that if $r_t(\theta)$ deviates from $[1-\epsilon, 1+\epsilon]$ interval, we won’t let the objective become too negative (that is, we won’t excessively lower the probability of that action).

2. Value Function Loss ( $L_{VF}$ )

L_{VF} = \frac{1}{2} \left\| V_{\theta}(s) - \left( \sum_{t'=t}^{T} \gamma^{t'-t} r_{t'} \Big| s_0 = s \right) \right\|^2_2

This is exactly the same as before:

$V_{\theta}(s)$ is the output of the value network (i.e. adding a linear layer on top of LLM to predict expected cumulative reward starting from state $s$ ).
The term $\sum \gamma^{t'} r_{t'}$ (let’s call it $G_s$ or target value) is the actual sum of discounted rewards observed starting from state $s$ and following current strategy until end of episode. This is the empirical target we set for $V_{\theta}(s)$ .
This loss function is the Mean Squared Error (MSE) between predicted value $V_{\theta}(s)$ and observed target value $G_s$ . We want the value function to accurately predict future rewards. This value function is crucial for calculating advantage $\hat{A}_t$ .

3. Entropy Bonus ( $L_{ENTROPY}$ )

L_{ENTROPY} = - \sum_x p(x) \log p(x)

Here $p(x)$ (or more accurately $\pi_{\theta}(a|s)$ , for all possible actions $a$ given state $s$ ) represents the action probability distribution output by current strategy in given state.
The term $\sum_x p(x) \log p(x)$ is the entropy of this probability distribution. Entropy measures randomness or uncertainty of distribution. Uniform distribution (very random) has high entropy, while peaked distribution (very certain about an action) has low entropy.
The loss term is negative entropy. When we minimize this $L_{ENTROPY}$ in total loss $L_{PPO}$ (assuming $c_2$ is positive), we are actually maximizing the entropy of the strategy.

Encouraging higher entropy can promote exploration, making the strategy a bit more random, trying different actions (trying different tokens in LLM case), instead of converging too quickly to a possibly suboptimal deterministic strategy. This helps Agent discover better strategies.

Final Form $L_{PPO}$

The final PPO loss is the weighted sum of these three parts:

L_{PPO} = L_{POLICY} + c_1 L_{VF} + c_2 L_{ENTROPY}

$c_1 L_{VF}$ : Value function loss, weighted by $c_1$ . A common value for $c_1$ is around $0.5$ .
$c_2 L_{ENTROPY}$ : Entropy bonus (if $c_2 > 0$ , actually penalty for low entropy), weighted by $c_2$ . $c_2$ is usually a small positive constant (e.g. $0.01$ ), used to encourage exploration while not overwhelming the main policy objective.

Agent parameters (i.e. LLM weights) are updated by calculating the gradient of this combined loss $L_{PPO}$ and performing gradient descent.

Reference Model

Reward Hacking

A major problem in RL is reward-hacking, where the model might learn to always output tokens or sequences that bring good rewards but make no sense to humans, such as saying “thank you” ten times in a row to boost politeness score. So we hope the aligned model’s (after RL post-training) output is as close as possible to the original model’s output.

Therefore, there will be another model with frozen weights (ref model). When the model we want to optimize generates rewards through reward model in each step of each trajectory, this reward will subtract the KL divergence between log prob of ref model and optimized model as a penalty term to prevent the model from generating answers that differ too much from the original model, thus preventing the cheating phenomenon mentioned above.

Screenshot 2025-05-08 at 00.43.14

Code walk through

trl

class AutoModelForCausalLMWithValueHead(PreTrainedModelWrapper):
    # ... (class attributes like transformers_parent_class) ...

The core purpose of this class is to bundle a standard Causal Language Model (Causal LM) (our Actor Model, strategy $π_θ$ responsible for generating text) with a Value Head (our Critic Model, responsible for estimating state value V(s)). In PPO / Actor Critic algorithms, we need both strategy and value function, and this class provides a unified model structure to output both simultaneously.

    def __init__(self, pretrained_model, **kwargs):
        super().__init__(pretrained_model, **kwargs) # Basic settings
        v_head_kwargs, _, _ = self._split_kwargs(kwargs) # Separate args for ValueHead
 
        # Ensure input uses a model with language model output capability
        if not any(hasattr(self.pretrained_model, attribute) for attribute in self.lm_head_namings):
            raise ValueError("The model does not have a language model head...")
 
        # Create ValueHead instance, which will learn to predict state value V(s)
        self.v_head = ValueHead(self.pretrained_model.config, **v_head_kwargs)
 
        # Initialize ValueHead weights
        self._init_weights(**v_head_kwargs) # Default random init, can also specify normal distribution init

Acting as Actor: That is our language model pretrained_model, which generates responses (action a, i.e., a series of tokens) based on current prompt (state s).
Critic: Evaluate “good/bad” of Actor in a certain state s, i.e., output $V(s)$ . This is the task of linear layer self.v_head.

    def forward(
        self,
        input_ids=None, # Input token IDs (state s)
        attention_mask=None,
        past_key_values=None, # For speeding up generation
        **kwargs,
    ):
        # Force underlying model to output hidden_states, ValueHead needs them as input
        kwargs["output_hidden_states"] = True
        # ... (some details of processing past_key_values and PEFT, can be ignored in core PPO understanding)
 
        # 1. Actor (Base Language Model) calculation
        base_model_output = self.pretrained_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            **kwargs,
        )
 
        # 2. Extract Actor output (for policy update) and Critic input
        lm_logits = base_model_output.logits # Actor output: probability distribution predicting next token
        # This is the basis for calculating L_POLICY and L_ENTROPY in PPO
 
        last_hidden_state = base_model_output.hidden_states[-1] # Critic input: hidden state of last layer of LM,
        # Represents the representation of current state s
 
        # (Optional) Language model's own loss, usually not directly used in RL stage
        loss = base_model_output.loss
 
        # (Ensure data and model are on same device)
        if last_hidden_state.device != self.v_head.summary.weight.device:
            last_hidden_state = last_hidden_state.to(self.v_head.summary.weight.device)
 
        # 3. Critic (ValueHead) calculation
        # ValueHead receives state representation, outputs value estimation V(s) for that state
        value = self.v_head(last_hidden_state).squeeze(-1) # This is the basis for calculating value loss L_VF and advantage A_hat in PPO
 
        # (Ensure logits are float32 for numerical stability)
        if lm_logits.dtype != torch.float32:
            lm_logits = lm_logits.float()
 
        # Return Actor's logits, LM loss (may be None), and Critic's value
        return (lm_logits, loss, value)

For every step of PPO-RLHF training:

We input a batch of current prompts (sequence input_ids) into the model.
self.pretrained_model (Actor) will calculate (Rollout) lm_logits. These logits represent the probability distribution of which tokens the model thinks should be generated next given the current prompt. PPO’s policy loss $L_{POLICY}$ and entropy bonus $L_{ENTROPY}$ both need to be calculated based on this probability distribution $π_θ(a_t∣s_t)$ .
At the same time, we take last_hidden_state from base_model_output. This can be seen as a vector representation of current prompt (state s).
This last_hidden_state is sent into self.v_head (Critic), outputting a scalar value. This value is the model’s value estimate $V_θ(s)$ for current state s. PPO’s value function loss $L_{VF}$ is to optimize this $V_θ(s)$ to be as close as possible to true return. And this $V_θ(s)$ is also a key component for calculating advantage function $A^t$ , which in turn guides the calculation of $L_{POLICY}$ .
The same prompt + response sequence is input to Reward and Reference model for inference to get reward and log probs (calculating KL penalty).

So with one forward call, we simultaneously obtain core information needed to update Actor (Strategy) and Critic (Value Function). The training flow can be understood with the help of the following diagram:

rlhf-pipeline

[!tip] In RLHF, only Actor needs Prefill + Decode (Complete Auto-Regressive Generation) during experience collection (Rollout), other models only process existing responses to get logprob and value etc., doing only Prefill.

In addition, Actor involves training and inference (referring to Rollout), so it requires training engine (such as Megatron, DeepSpeed and FSDP) + rollout engine (such as SGLang and vLLM) to complete their tasks respectively; Critic reuses internal representations in forward during training for new value prediction during inference, so it runs in same training engine; while Reference and Reward model both only use inference engine to get logprob and reward. $^{[3]}$

verl

Along with OpenRLHF etc., are excellent RLHF frameworks, a good introductory guide: [AI Infra] VeRL Framework Introduction & Code Walkthrough