GRPO-Guard

Last updated: 05/08/2026.

GRPO-Guard (paper) is an extension of Flow-GRPO that stabilizes the importance-ratio estimate used in the policy loss. The standard Flow-GRPO ratio \(\rho = \exp(\log p_\theta - \log p_{\text{old}})\) can become numerically unbalanced when only a single Monte-Carlo noise sample \(z\) is used per denoising step, causing high-variance gradients and aggressive clipping.

GRPO-Guard adds a ratio-mean bias correction that explicitly penalises drift in the reverse-SDE proposal mean of the current policy relative to the rollout policy, and rescales the per-step loss by \(1 / (\sqrt{-dt})^2\) so the gradient magnitude is consistent across denoising steps.

Algorithm

For step \(t\) with proposal mean \(\mu_\theta(x_t)\) from the current policy and \(\mu_{\text{old}}(x_t)\) from the rollout policy, SDE noise scale \(\sigma_t = \mathrm{std\\_dev\\_t}\), and \(\sqrt{-dt}\):

\[ b_t = \frac{\lVert \mu_\theta - \mu_{\text{old}} \rVert_{\text{mean}}^2} {2 (\sqrt{-dt}\, \sigma_t)^2} \]
\[ \rho_t = \exp\big((\log p_\theta - \log p_{\text{old}} + b_t) \cdot (\sqrt{-dt}\, \sigma_t)\big) \]
\[ \mathcal{L}^{\text{guard}}_t = \frac{1}{(\sqrt{-dt})^2}\; \mathbb{E}\big[\max(-A_t \rho_t,\ -A_t \mathrm{clip}(\rho_t, 1-\epsilon, 1+\epsilon))\big] \]

The squared-norm in \(b_t\) is averaged over the channel and spatial dimensions of the latent (see GRPOGuardLoss in verl_omni/trainer/diffusion/diffusion_algos.py).

Configuration

GRPO-Guard reuses the entire Flow-GRPO training stack — only the actor loss mode changes. Refer to Flow-GRPO for advantage estimator, rollout, sampling, batch-size, and reward configuration.

To enable GRPO-Guard:

  • actor_rollout_ref.actor.diffusion_loss.loss_mode=grpo_guard

  • actor_rollout_ref.rollout.algo.sde_type=sde

A typical small clip ratio works well with the additional bias term:

  • actor_rollout_ref.actor.diffusion_loss.clip_ratio=2e-6

KL regularisation against a frozen reference policy still works the same way as Flow-GRPO (actor_rollout_ref.actor.use_kl_loss=True, actor_rollout_ref.actor.kl_loss_coef=...).

Example script

A 4-card collocated training script is provided:

bash examples/grpoguard_trainer/run_qwen_image_ocr_lora.sh

It reuses the Flow-GRPO Qwen-Image OCR setup and only flips the actor loss mode, the clip ratio, and the experiment name. Dataset and model preparation follow the same instructions as the Flow-GRPO quick-start.

References