Flow-GRPO
Last updated: 05/13/2026.
Flow-GRPO (paper, code) is the first method to integrate online policy gradient reinforcement learning into flow matching generative models (e.g., Stable Diffusion 3, FLUX). It enables direct reward optimization for tasks such as compositional text-to-image generation, visual text rendering, and human preference alignment, without modifying the standard inference pipeline.
Two core technical contributions make this possible:
ODE-to-SDE Conversion: Flow matching models natively use a deterministic ODE sampler. Flow-GRPO converts this ODE into an equivalent SDE that preserves the model’s marginal distribution at every timestep. This introduces the stochasticity required for group sampling and RL exploration.
Denoising Reduction: Training on all denoising steps is expensive. Flow-GRPO reduces the number of training steps while keeping the original number of inference steps, significantly improving sampling efficiency without sacrificing reward performance.
Empirically, RL-tuned SD3.5-M with Flow-GRPO raises GenEval accuracy from 63% to 95% and visual text rendering accuracy from 59% to 92%.
Key Components
Flow Matching Backbone: operates on continuous-time flow matching models (e.g., SD3.5, FLUX) rather than discrete-token LLMs.
ODE-to-SDE Rollout: generates a group of diverse image trajectories by injecting controlled noise via SDE sampling at selected denoising steps.
Denoising Reduction: trains on a reduced subset of denoising steps (configurable via
sde_window_sizeandsde_window_range) while inference uses the full step count.Image Reward Models: rewards are assigned by external reward models (e.g., GenEval, OCR, PickScore, aesthetic score) rather than rule-based verifiers.
No Critic: like GRPO for LLMs, no separate value network is trained; advantages are computed from group-relative rewards.
Key Differences: GRPO vs. Flow-GRPO
Dimension |
GRPO (LLM) |
Flow-GRPO (Diffusion) |
|---|---|---|
Model type |
Autoregressive language model |
Flow matching / diffusion model |
Action space |
Discrete token sequences |
Continuous denoising trajectories (SDE paths) |
Rollout mechanism |
Sample |
Convert ODE to SDE; sample |
Log-probability |
Standard next-token log-prob |
Log-prob of the SDE noise prediction at each selected denoising step |
Training steps |
All decoding steps are trivially identical in cost |
Denoising Reduction: train on a small window of steps, infer with full steps |
Reward signal |
Rule-based verifiers or LLM judges on text |
Image reward models (GenEval, OCR, PickScore, aesthetic, etc.) |
KL regularization |
KL penalty added to reward or directly to loss |
KL-style regularization is available, but the exact setup depends on the training config |
CFG (guidance) |
Not applicable |
CFG distillation occurs naturally; CFG can be disabled at both train and test time |
Advantage estimator |
|
|
Loss mode |
|
|
Configuration
Diffusion training now uses dedicated diffusion config blocks. In verl_omni/trainer/config/diffusion_trainer.yaml,
the main sections are:
algorithm: diffusion-specific advantage computation and normalizationactor_rollout_ref.actor: optimization and diffusion loss settingsactor_rollout_ref.rollout: rollout backend, sampling, and SDE controlsactor_rollout_ref.model: model path plus diffusion-model / LoRA settingsreward: reward manager, reward model, and custom reward function
The default diffusion model YAML mirrors rollout fields (pipeline and algo) into actor_rollout_ref.model.*, so in practice
the rollout section is the main place to override sampling behavior.
Core parameters
Algorithm
algorithm.adv_estimator: Set toflow_grpo.
Actor / loss
actor_rollout_ref.actor.diffusion_loss.loss_mode: Set toflow_grpo.actor_rollout_ref.actor.diffusion_loss.clip_ratio: clipping factor used in the diffusion loss.actor_rollout_ref.actor.diffusion_loss.adv_clip_max: Maximum absolute advantage used before computing the policy loss.actor_rollout_ref.actor.use_kl_loss: Enables KL loss against the reference policy.actor_rollout_ref.actor.kl_loss_coef: Coefficient for the KL term when KL enabled.
Rollout / sampling
actor_rollout_ref.rollout.name: Selects the rollout backend. Currently supportsvllm_omni.actor_rollout_ref.rollout.n: Number of sampled image trajectories per prompt. This is the FlowGRPO group size and should be greater than1.actor_rollout_ref.rollout.algo.noise_level: Magnitude of SDE noise injected during rollout. Larger values increase diversity but can hurt image quality.actor_rollout_ref.rollout.algo.sde_type: SDE variant for rollout. The current example usessde.actor_rollout_ref.rollout.algo.sde_window_size: Number of denoising steps included in the active training window. Smaller values reduce training cost.actor_rollout_ref.rollout.algo.sde_window_range: Range used to sample the start of that active denoising window.actor_rollout_ref.rollout.pipeline.num_inference_steps: Number of denoising steps used for rollout generation during training.actor_rollout_ref.rollout.val_kwargs.pipeline.num_inference_steps: Number of denoising steps used during validation / evaluation.actor_rollout_ref.rollout.pipeline.true_cfg_scale: True classifier-free guidance scale used during rollout. Used inQwen-Image.actor_rollout_ref.rollout.pipeline.guidance_scale: Distilled guidance scale for models that expose a guidance embedding; keepnullto disable it.
Model
actor_rollout_ref.model.path: Base diffusion model path.actor_rollout_ref.model.tokenizer_path: Optional tokenizer path if it is not located under the model path.actor_rollout_ref.model.lora_rank: LoRA rank. Set to a positive integer to enable LoRA fine-tuning (e.g.,64).actor_rollout_ref.model.lora_alpha: LoRA scaling factor (default64).actor_rollout_ref.model.lora_init_weights: LoRA initialization method (default"gaussian").actor_rollout_ref.model.target_modules: Target modules for LoRA (default"all-linear").actor_rollout_ref.model.lora_dtype: Optional dtype to convert LoRA parameters to for numerical stability during training (e.g.,"fp32","bf16"). Defaultnullmeans no conversion.
Batch size
FlowGRPO uses three nested batch-size parameters that operate at different stages of the training loop. They address different concerns (RL sample diversity, multi-epoch reuse, and GPU memory) and must be understood together.
Step 1 — Rollout (data.train_batch_size)
data.train_batch_size is the number of unique prompts drawn from the
dataset per training step. Before rollout, each prompt is replicated
actor_rollout_ref.rollout.n times so that the rollout engine generates n
independent image trajectories per prompt. The in-memory batch after rollout
therefore holds train_batch_size × n image samples. GRPO advantage
normalization runs over this full batch — it needs all n trajectories
for every prompt to compute group-relative rewards before any splitting occurs.
Step 2 — Actor update (actor_rollout_ref.actor.ppo_mini_batch_size)
ppo_mini_batch_size controls how the full post-rollout batch is sliced for
actor gradient updates. Important: this value is specified in prompts,
not image samples. The trainer internally scales it by rollout.n to get
the actual mini-batch size in samples:
effective mini-batch = ppo_mini_batch_size × rollout.n (image samples)
number of mini-batches per epoch = train_batch_size / ppo_mini_batch_size
All n trajectories belonging to the same prompt are kept in the same
mini-batch. This is not optional: although advantages are already computed
globally before this split, the gradient update for each image depends on its
advantage relative to the other images in its group. Scattering a prompt’s
trajectories across different mini-batches would break that correspondence.
ppo_mini_batch_size must divide train_batch_size evenly.
Step 3 — FSDP sharding and gradient accumulation
(actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu)
Each mini-batch is distributed across GPUs by FSDP data parallelism, so each
GPU receives (ppo_mini_batch_size × n) / n_gpus image samples. That
per-GPU shard is then chunked into micro-batches of
ppo_micro_batch_size_per_gpu for the actual forward/backward passes, with
gradients accumulated across chunks before the optimizer step. This is pure
gradient accumulation: the effective gradient is identical to running the full
per-GPU shard in one shot; only peak activation memory changes.
For diffusion models the accumulation is two-dimensional: the engine also loops over each active denoising timestep inside every micro-batch, so the total gradient accumulation steps per GPU per mini-batch is:
gradient_accumulation_steps = (per_gpu_samples / ppo_micro_batch_size_per_gpu)
× sde_window_size
ppo_micro_batch_size_per_gpu must satisfy:
(ppo_mini_batch_size × n) / n_gpus is divisible by
ppo_micro_batch_size_per_gpu.
Concrete walkthrough (reference OCR script, 4 GPUs, sde_window_size=2):
data.train_batch_size = 32 # 32 prompts loaded
actor_rollout_ref.rollout.n = 16 # 16 images generated per prompt
→ post-rollout batch = 512 # advantage computed over all 512
ppo_mini_batch_size (config) = 16 # in prompts
→ effective mini-batch = 16 × 16 = 256 samples
→ mini-batches per epoch = 512 / 256 = 2 actor gradient steps
FSDP shards 256 samples across 4 GPUs:
→ per-GPU samples = 256 / 4 = 64
ppo_micro_batch_size_per_gpu = 16
→ micro-batches per GPU = 64 / 16 = 4
→ gradient_accumulation_steps = 4 × 2 (sde_window_size) = 8
Reward
reward.reward_manager.name: Selects the reward manager.reward.custom_reward_function.pathandreward.custom_reward_function.name: Register the task-specific reward post-processing function such ascompute_score_ocr.
For an end-to-end OCR training walkthrough, including dataset preparation and
the full runnable command, see docs/start/flowgrpo_quickstart.md.
Reference Example
Standard LoRA training with OCR reward (Qwen-Image, 4 GPUs) using the current
vllm_omni rollout example:
bash examples/flowgrpo_trainer/run_qwen_image_ocr_lora.sh
Variants
Rule-Based Reward Training: JPEG incompressibility
FlowGRPO also supports rule-based rewards that score images directly without a
VLM reward model, reusing the default VisualRewardManager from
verl_omni/trainer/config/reward/reward.yaml.
verl_omni/utils/reward_score/jpeg_compressibility.py rewards images that are
harder to JPEG-compress (richer texture, more complex content). No extra
dependencies or reward model process are required.
Minimal dataset row:
{
"data_source": "jpeg_compressibility",
"prompt": [{"role": "user", "content": "<your prompt>"}],
"reward_model": {"ground_truth": ""}, # required by schema, ignored by scorer
}
Config changes relative to the OCR example — remove these lines:
reward.reward_model.enable=True
reward.reward_model.model_path=...
reward.reward_model.rollout.name=...
reward.reward_model.rollout.tensor_model_parallel_size=...
reward.custom_reward_function.path=...
reward.custom_reward_function.name=...
Keep all actor/rollout settings unchanged; the visual reward manager is loaded from the default reward config.
Async Reward
For reward models that are expensive to evaluate (e.g., a VLM judge), the reward model can be allocated its own dedicated GPU resource pool and run asynchronously alongside the policy. This avoids blocking policy training on reward computation.
bash examples/flowgrpo_trainer/run_qwen_image_ocr_lora_async_reward.sh
Full Model Training
We have provided a script to enable non-cfg full-weight Qwen-Image OCR training. The example is runnable on 4 NVIDIA H200 GPUs; enabling CFG requires more GPU resources.
bash examples/flowgrpo_trainer/run_qwen_image_ocr.sh
Sequence parallelism (Ulysses SP)
Ulysses SP is supported for diffusion model training and requires diffusers >= 0.38.0.
It shards the sequence dimension across GPUs within a SP group,
reducing per-GPU memory for long-sequence and high-resolution training.
actor_rollout_ref.actor.fsdp_config.ulysses_sequence_parallel_size: Number of GPUs in the SP group. Must be a divisor of the total GPU count. Set to1(default) to disable SP. Common values:2,4,8.
When SP is enabled, FSDP data parallelism is automatically reduced:
dp_size = total_gpus / ulysses_sequence_parallel_size
For SP training, num_attention_heads must be divisible by
ulysses_sequence_parallel_size.
A ready-to-use 4-GPU SP=2 example is provided:
bash examples/flowgrpo_trainer/run_qwen_image_ocr_lora_sp2.sh
Citation
@article{liu2025flow,
title={Flow-GRPO: Training Flow Matching Models via Online RL},
author={Liu, Jie and Liu, Gongye and Liang, Jiajun and Li, Yangguang and Liu, Jiaheng and Wang, Xintao and Wan, Pengfei and Zhang, Di and Ouyang, Wanli},
journal={arXiv preprint arXiv:2505.05470},
year={2025}
}