GRPO-Guard
Last updated: 05/08/2026.
GRPO-Guard (paper) is an extension of Flow-GRPO that stabilizes the importance-ratio estimate used in the policy loss. The standard Flow-GRPO ratio \(\rho = \exp(\log p_\theta - \log p_{\text{old}})\) can become numerically unbalanced when only a single Monte-Carlo noise sample \(z\) is used per denoising step, causing high-variance gradients and aggressive clipping.
GRPO-Guard adds a ratio-mean bias correction that explicitly penalises drift in the reverse-SDE proposal mean of the current policy relative to the rollout policy, and rescales the per-step loss by \(1 / (\sqrt{-dt})^2\) so the gradient magnitude is consistent across denoising steps.
Algorithm
For step \(t\) with proposal mean \(\mu_\theta(x_t)\) from the current policy and \(\mu_{\text{old}}(x_t)\) from the rollout policy, SDE noise scale \(\sigma_t = \mathrm{std\\_dev\\_t}\), and \(\sqrt{-dt}\):
The squared-norm in \(b_t\) is averaged over the channel and spatial dimensions
of the latent (see GRPOGuardLoss in
verl_omni/trainer/diffusion/diffusion_algos.py).
Configuration
GRPO-Guard reuses the entire Flow-GRPO training stack — only the actor loss mode changes. Refer to Flow-GRPO for advantage estimator, rollout, sampling, batch-size, and reward configuration.
To enable GRPO-Guard:
actor_rollout_ref.actor.diffusion_loss.loss_mode=grpo_guardactor_rollout_ref.rollout.algo.sde_type=sde
A typical small clip ratio works well with the additional bias term:
actor_rollout_ref.actor.diffusion_loss.clip_ratio=2e-6
KL regularisation against a frozen reference policy still works the same way
as Flow-GRPO (actor_rollout_ref.actor.use_kl_loss=True,
actor_rollout_ref.actor.kl_loss_coef=...).
Example script
A 4-card collocated training script is provided:
bash examples/grpoguard_trainer/run_qwen_image_ocr_lora.sh
It reuses the Flow-GRPO Qwen-Image OCR setup and only flips the actor loss mode, the clip ratio, and the experiment name. Dataset and model preparation follow the same instructions as the Flow-GRPO quick-start.