# Flow-DPPO Last updated: 06/12/2026. Flow-DPPO ([paper](https://arxiv.org/abs/2606.11025)) is an extension of [Flow-GRPO](flowgrpo.md) that replaces PPO-style ratio clipping with an asymmetric divergence mask. It keeps Flow-GRPO's stochastic reverse-SDE rollout, group-relative advantages, and per-step log-prob ratios. ## Algorithm Flow-GRPO uses PPO ratio clipping to approximate a trust region, but the ratio at each denoising step is a single noisy sample from a high-dimensional Gaussian transition. Flow-DPPO uses the flow-model structure directly: the old rollout policy and the current replayed policy have the same SDE variance, so their per-step divergence is the Gaussian KL between transition means. For step $t$, let $\mu_{\text{old}}(x_t)$ be the frozen rollout transition mean, $\mu_\theta(x_t)$ be the current replayed transition mean, and $\sigma_t = \mathrm{std\\_dev\\_t}\sqrt{-dt}$ be the SDE transition noise scale. Flow-DPPO measures the mean-reduced policy drift: $$ D_t = \frac{\lVert \mu_\theta - \mu_{\text{old}} \rVert_{\text{mean}}^2} {2 \sigma_t^2} $$ Here $D_t$ is the exact per-step KL-like divergence between the current policy and the rollout policy for the Gaussian SDE transition. It replaces Flow-GRPO's ratio-clipping proxy with a direct trust-region check. The update is masked only when it is outside the divergence threshold and still moving farther from the rollout policy: - positive advantage, $\rho_t > 1$, and $D_t > \epsilon_D$ - negative advantage, $\rho_t < 1$, and $D_t > \epsilon_D$ Corrective updates stay active. See `FlowDPPOLoss` in [`verl_omni/trainer/diffusion/diffusion_algos.py`](../../verl_omni/trainer/diffusion/diffusion_algos.py). ## Configuration Flow-DPPO reuses the entire Flow-GRPO training stack — only the actor loss mode and divergence threshold change. Refer to [Flow-GRPO](flowgrpo.md) for advantage estimator, rollout, sampling, batch-size, and reward configuration. To enable Flow-DPPO: - `algorithm.adv_estimator=flow_grpo` - `actor_rollout_ref.actor.diffusion_loss.loss_mode=flow_dppo` - `actor_rollout_ref.actor.diffusion_loss.kl_mask_threshold=1e-5` - `actor_rollout_ref.rollout.algo.sde_type=sde` `actor_rollout_ref.actor.diffusion_loss.add_kl_coefficient=True` normalizes the mean drift by the scheduler's SDE noise scale `std_dev_t * sqrt_dt`, matching the Flow-SDE log-prob variance used during Qwen-Image training. ## Example script A 4-card collocated training script is provided: ```bash bash examples/flowdppo_trainer/run_qwen_image_ocr_lora.sh ``` It reuses the Flow-GRPO Qwen-Image OCR setup and only flips the actor loss mode, the divergence threshold, and the experiment name. Dataset and model preparation follow the same instructions as the [Flow-GRPO quick-start](../start/flowgrpo_quickstart.md). ## References - [Flow-GRPO: Online policy gradient RL for flow matching models](https://arxiv.org/abs/2505.05470) - [Flow-DPPO: Divergence proximal policy optimization for diffusion models](https://arxiv.org/abs/2606.11025) - [UniRL Flow-DPPO implementation](https://github.com/Tencent-Hunyuan/UniRL/blob/main/unirl/algorithms/flowdppo.py) - [GRPO-Guard: ratio-bias regularisation for diffusion-model RL](https://arxiv.org/abs/2510.22319)