Flow-DPPO

Last updated: 06/12/2026.

Flow-DPPO (paper) is an extension of Flow-GRPO that replaces PPO-style ratio clipping with an asymmetric divergence mask. It keeps Flow-GRPO’s stochastic reverse-SDE rollout, group-relative advantages, and per-step log-prob ratios.

Algorithm

Flow-GRPO uses PPO ratio clipping to approximate a trust region, but the ratio at each denoising step is a single noisy sample from a high-dimensional Gaussian transition. Flow-DPPO uses the flow-model structure directly: the old rollout policy and the current replayed policy have the same SDE variance, so their per-step divergence is the Gaussian KL between transition means.

For step \(t\), let \(\mu_{\text{old}}(x_t)\) be the frozen rollout transition mean, \(\mu_\theta(x_t)\) be the current replayed transition mean, and \(\sigma_t = \mathrm{std\\_dev\\_t}\sqrt{-dt}\) be the SDE transition noise scale. Flow-DPPO measures the mean-reduced policy drift:

\[ D_t = \frac{\lVert \mu_\theta - \mu_{\text{old}} \rVert_{\text{mean}}^2} {2 \sigma_t^2} \]

Here \(D_t\) is the exact per-step KL-like divergence between the current policy and the rollout policy for the Gaussian SDE transition. It replaces Flow-GRPO’s ratio-clipping proxy with a direct trust-region check. The update is masked only when it is outside the divergence threshold and still moving farther from the rollout policy:

positive advantage, \(\rho_t > 1\), and \(D_t > \epsilon_D\)
negative advantage, \(\rho_t < 1\), and \(D_t > \epsilon_D\)

Corrective updates stay active. See FlowDPPOLoss in verl_omni/trainer/diffusion/diffusion_algos.py.

Configuration

Flow-DPPO reuses the entire Flow-GRPO training stack — only the actor loss mode and divergence threshold change. Refer to Flow-GRPO for advantage estimator, rollout, sampling, batch-size, and reward configuration.

To enable Flow-DPPO:

algorithm.adv_estimator=flow_grpo
actor_rollout_ref.actor.diffusion_loss.loss_mode=flow_dppo
actor_rollout_ref.actor.diffusion_loss.kl_mask_threshold=1e-5
actor_rollout_ref.rollout.algo.sde_type=sde

actor_rollout_ref.actor.diffusion_loss.add_kl_coefficient=True normalizes the mean drift by the scheduler’s SDE noise scale std_dev_t * sqrt_dt, matching the Flow-SDE log-prob variance used during Qwen-Image training.

Example script

A 4-card collocated training script is provided:

bash examples/flowdppo_trainer/run_qwen_image_ocr_lora.sh

It reuses the Flow-GRPO Qwen-Image OCR setup and only flips the actor loss mode, the divergence threshold, and the experiment name. Dataset and model preparation follow the same instructions as the Flow-GRPO quick-start.

Flow-DPPO

Algorithm

Configuration

Example script

References