Does each policy gradient update only contain O(1) bits of information about the model we are trying to train while each SFT update has O(T) where T is the number of tokens in the example?
I think the answer is not a clean binary. Let’s examine this in a simple setup.
Let’s consider a simple language model. It is one that could only be prompted with "Tell me a joke" (and won't work with anything else), and it can generate exactly N jokes, each exactly T tokens, and each with probability 1/N. We will see that T appears in none of our arguments.
Let’s assume that we have a binary reward that determines whether the joke is desirable (say safe and good). And let’s assume that each joke is good with probability α, and bad with probability (1 − α).
So, the ideal solution is clear. Filter out the bad jokes and generate each of the K good jokes with probability 1/K each. Here K is a random variable whose mean is αN. This is what we’d get from solving population level RL with no KL regularization. Let’s call this solution p*.
Now, let’s consider a policy gradient update on trajectory τ. Before doing that, we need to define a model family p_θ; and for that let’s just define a simple model family where we have θ is an element of the N-dimensional simplex, i.e., θ_i ∈ [0, 1], and ∑_i θ_i = 1. Then, p_θ(τ_i) = θ_i, where τ_i is the i’th joke (or trajectory).
With this notation let τ be a random trajectory drawn from p_θ. We have G = S · Adv, where S = ∇ log p_θ(τ).
Let’s consider the very first step of training where θ = (1/N, …, 1/N) and hence S is independent of p*, and history = []. Thus,
I(G; p* | history) ≤ I(Adv; p* | S) ≤ h(α).
This bound is intended only for the very first step of training when θ is uniform and S is independent of p*. After the first step, S can depend on history, so this simple bound need not apply as written but overall information is probably still upper bounded by 1 bit.
Let’s also consider supervised learning in this case, and consider training on data that keeps getting randomly generated through positive examples. The end state of training is also going to be the same p* in this case but let’s see how much mutual information does one example contain about p*?
The update is going to be G = ∇ log p_θ(τ) where τ is a positive example.
I(G; p* | history) = I(τ; p* | history) ≤ log(1/α).
In the RL case, each example resolves at most 1 bit of information about the model by determining whether or not a random rollout was good. In the SFT case, the information depends on how rare such an outcome is. If α > 1/2, then SFT also provides at most 1 bit of information about the ideal model. But if this is a rare outcome, then SFT provides significantly more information about the outcome.
Think about it this way, if there is only one good joke out of the millions that the model may generate, then RL needs millions of rollouts to stumble upon that good joke, whereas if SFT points directly to that joke, then we can just learn it and move on. But if there is just one bad joke in the millions of jokes that the model can generate, neither SFT nor RL is sample efficient because we need to see millions of rollout to determine which joke was bad and filter it out.
Notes and clarifications.
• T does not appear in these arguments. The contrast here is not O(1) vs O(T) in this toy setup. It is O(1) for a binary sequence-level reward in the first PG step, versus up to log(1/α) for a positive SFT example.
• Data processing. You can view p* as filtering each outcome independently with probability (1 − α), so any K-sized good set where E{K} = αN is possible. Under this process the log(1/α) bound follows from how a positive example shrinks the consistent good-set space.
• Richer rewards. The ≤ 1 bit conclusion for RL depends on the reward being binary at the sequence level. If the reward is real-valued and informative, a single update can carry more than 1 bit. The discussion here is about the binary case. Cases like GRPO/RLOO where the reward is calibrated against several rollouts potentially contain more information.
• Exploration. The “rare good joke” gap assumes naive on-policy sampling. Guided exploration or data selection (for example using a reward model or replay that favors promising trajectories or some other form of supervision) can reduce the number of rollouts needed to find rare good outcomes.
• KL regularization. In practice RLHF is usually KL-regularized around a reference model. The KL shapes token-level behavior and can significantly change these information-theoretic arguments.
• In practice, it is not uncommon to do poor man's RL through SFT by rolling out the model many times and keep only the examples that are good and filter out the bad ones and do SFT on the good examples. If we do that, then we will end up needing 1/α rollouts per each SFT example so even though we get log(1/α) bits of information per SFT example update, we will still get O(1) information per original model rollout.
Happy to hear your thoughts on these arguments!