PPO vs GRPO vs REINFORCE – a workflow breakdown of the most talked-about reinforcement learning algorithms
➡️ Proximal Policy Optimization (PPO): The Stable Learner
It’s used everywhere from dialogue agents to instruction tuning as it balances between learning fast and staying safe.
▪️ How PPO works step by step:
1. A query is fed into the Policy Model (which is trainable), and it produces an output.
2. That output gets sent to 2 frozen models for scoring:
🔹 The Reference Model calculates how far the new output strays from the original behavior using KL divergence.
🔹 The Reward Model gives the output a score r, evaluating its helpfulness, coherence, or alignment.
3. The critic’s take:
🔹 The Value Model (also trained) tries to predict how good that output should have been, producing v - an expected reward.
4. Calculating advantage:
PPO uses Generalized Advantage Estimation (GAE) to figure out the advantage, meaning how much better or worse the action was compared to expectations.
5. Gentle updates only:
This is where PPO earns its name.
• It uses a clipped objective to prevent wild updates to the policy, limiting how much the new version can diverge from the old one.
• It may also watch the KL divergence to double-check the policy isn't drifting too far.
6. Joint optimization:
PPO updates the policy, value function, and sometimes adds entropy to keep the model exploring new ideas.
✅ Why PPO is good?
▪️ Thanks to clipping and KL control, PPO is hard to break, so it stays stable.
▪️ The value function helps squeeze more learning from fewer samples.
---
➡️ Group Relative Policy Optimization (GRPO): Learning by Comparison
GRPO skips the value model, and is tailored for reasoning-heavy tasks where relative quality matters more than absolute scores.
▪️ GRPO in action:
1. The policy model takes a query and generates a group of answers, which gives us a playground for comparison.
2. Each answer gets scored:
🔹 The Reward Model evaluates all outputs with rewards r1, r2, ...
🔹 GRPO normalizes these scores, subtracting the group’s mean and dividing by standard deviation.
🔹 Now each output knows where it stands relative to its peers.
3. No critic model:
That relative score becomes the advantage. No need for a separate value model.
4. Smart advantage propagation:
In case of chain-of-thought reasoning, GRPO assigns rewards to individual steps, then backpropagates scores to all earlier tokens.
Tokens contributing early to a strong answer gain more credit, guiding the model on a productive reasoning path.
🔄 Iterative GRPO
GRPO retrains the Reward Model with new, better outputs, and refreshes the Reference Model alongside the policy to keep the KL penalty meaningful. It reuses a bit of old data (~10%) to stabilize training and avoid forgetting
✅ Why GRPO can be a better choice:
• No value model = no extra weight
• Relative Rewards = stronger signals
• Perfect for tasks with multiple steps or structured thinking
• Can handle longer sequences and bigger batches
---
➡️ REINFORCE: The Monte Carlo Policy Gradient
REINFORCE is like “vanilla” policy gradient: it updates the policy directly based on full-episode returns, without needing a value model.
▪️ How REINFORCE works:
1. The trainable policy model interacts with the environment, producing actions until an episode ends.
2. For each action taken at time t, REINFORCE computes the return (Gt) - the sum of discounted rewards from t to the end.
3. Policy update rule:
Each action gets reinforced proportional to how good its return was. Actions that led to high returns become more probable.
4. Optionally add a baseline:
To reduce variance, subtract a baseline, for example a value estimate or average reward.
5. Run new episodes with the updated policy, gather new returns, and update again.
✅ Why REINFORCE matters:
• Simple and unbiased: A pure Monte Carlo estimator of the policy gradient.
• No critic needed
• Conceptual foundation for many modern algorithms like GRPO.