# GRPO-Guard Last updated: 05/08/2026. GRPO-Guard ([paper](https://arxiv.org/abs/2510.22319)) is an extension of [Flow-GRPO](flowgrpo.md) that stabilizes the importance-ratio estimate used in the policy loss. The standard Flow-GRPO ratio $\rho = \exp(\log p_\theta - \log p_{\text{old}})$ can become numerically unbalanced when only a single Monte-Carlo noise sample $z$ is used per denoising step, causing high-variance gradients and aggressive clipping. GRPO-Guard adds a **ratio-mean bias** correction that explicitly penalises drift in the reverse-SDE proposal mean of the current policy relative to the rollout policy, and rescales the per-step loss by $1 / (\sqrt{-dt})^2$ so the gradient magnitude is consistent across denoising steps. ## Algorithm For step $t$ with proposal mean $\mu_\theta(x_t)$ from the current policy and $\mu_{\text{old}}(x_t)$ from the rollout policy, SDE noise scale $\sigma_t = \mathrm{std\\_dev\\_t}$, and $\sqrt{-dt}$: $$ b_t = \frac{\lVert \mu_\theta - \mu_{\text{old}} \rVert_{\text{mean}}^2} {2 (\sqrt{-dt}\, \sigma_t)^2} $$ $$ \rho_t = \exp\big((\log p_\theta - \log p_{\text{old}} + b_t) \cdot (\sqrt{-dt}\, \sigma_t)\big) $$ $$ \mathcal{L}^{\text{guard}}_t = \frac{1}{(\sqrt{-dt})^2}\; \mathbb{E}\big[\max(-A_t \rho_t,\ -A_t \mathrm{clip}(\rho_t, 1-\epsilon, 1+\epsilon))\big] $$ The squared-norm in $b_t$ is averaged over the channel and spatial dimensions of the latent (see `GRPOGuardLoss` in [`verl_omni/trainer/diffusion/diffusion_algos.py`](../../verl_omni/trainer/diffusion/diffusion_algos.py)). ## Configuration GRPO-Guard reuses the entire Flow-GRPO training stack — only the actor loss mode changes. Refer to [Flow-GRPO](flowgrpo.md) for advantage estimator, rollout, sampling, batch-size, and reward configuration. To enable GRPO-Guard: - `actor_rollout_ref.actor.diffusion_loss.loss_mode=grpo_guard` - `actor_rollout_ref.rollout.algo.sde_type=sde` A typical small clip ratio works well with the additional bias term: - `actor_rollout_ref.actor.diffusion_loss.clip_ratio=2e-6` KL regularisation against a frozen reference policy still works the same way as Flow-GRPO (`actor_rollout_ref.actor.use_kl_loss=True`, `actor_rollout_ref.actor.kl_loss_coef=...`). ## Example script A 4-card collocated training script is provided: ```bash bash examples/grpoguard_trainer/run_qwen_image_ocr_lora.sh ``` It reuses the Flow-GRPO Qwen-Image OCR setup and only flips the actor loss mode, the clip ratio, and the experiment name. Dataset and model preparation follow the same instructions as the [Flow-GRPO quick-start](../start/flowgrpo_quickstart.md). ## References - [Flow-GRPO: Online policy gradient RL for flow matching models](https://arxiv.org/abs/2505.05470) - [GRPO-Guard: ratio-bias regularisation for diffusion-model RL](https://arxiv.org/abs/2510.22319)