Flow-GRPO

Last updated: 05/13/2026.

Flow-GRPO (paper, code) is the first method to integrate online policy gradient reinforcement learning into flow matching generative models (e.g., Stable Diffusion 3, FLUX). It enables direct reward optimization for tasks such as compositional text-to-image generation, visual text rendering, and human preference alignment, without modifying the standard inference pipeline.

Two core technical contributions make this possible:

ODE-to-SDE Conversion: Flow matching models natively use a deterministic ODE sampler. Flow-GRPO converts this ODE into an equivalent SDE that preserves the model’s marginal distribution at every timestep. This introduces the stochasticity required for group sampling and RL exploration.
Denoising Reduction: Training on all denoising steps is expensive. Flow-GRPO reduces the number of training steps while keeping the original number of inference steps, significantly improving sampling efficiency without sacrificing reward performance.

Empirically, RL-tuned SD3.5-M with Flow-GRPO raises GenEval accuracy from 63% to 95% and visual text rendering accuracy from 59% to 92%.

Key Components

Flow Matching Backbone: operates on continuous-time flow matching models (e.g., SD3.5, FLUX) rather than discrete-token LLMs.
ODE-to-SDE Rollout: generates a group of diverse image trajectories by injecting controlled noise via SDE sampling at selected denoising steps.
Denoising Reduction: trains on a reduced subset of denoising steps (configurable via sde_window_size and sde_window_range) while inference uses the full step count.
Image Reward Models: rewards are assigned by external reward models (e.g., GenEval, OCR, PickScore, aesthetic score) rather than rule-based verifiers.
No Critic: like GRPO for LLMs, no separate value network is trained; advantages are computed from group-relative rewards.

Key Differences: GRPO vs. Flow-GRPO

Dimension	GRPO (LLM)	Flow-GRPO (Diffusion)
Model type	Autoregressive language model	Flow matching / diffusion model
Action space	Discrete token sequences	Continuous denoising trajectories (SDE paths)
Rollout mechanism	Sample `n` token sequences per prompt	Convert ODE to SDE; sample `n` image trajectories per prompt via stochastic denoising
Log-probability	Standard next-token log-prob	Log-prob of the SDE noise prediction at each selected denoising step
Training steps	All decoding steps are trivially identical in cost	Denoising Reduction: train on a small window of steps, infer with full steps
Reward signal	Rule-based verifiers or LLM judges on text	Image reward models (GenEval, OCR, PickScore, aesthetic, etc.)
KL regularization	KL penalty added to reward or directly to loss	KL-style regularization is available, but the exact setup depends on the training config
CFG (guidance)	Not applicable	CFG distillation occurs naturally; CFG can be disabled at both train and test time
Advantage estimator	`algorithm.adv_estimator=grpo`	`algorithm.adv_estimator=flow_grpo`
Loss mode	`actor_rollout_ref.actor.policy_loss.loss_mode` not diffusion-specific	`actor_rollout_ref.actor.diffusion_loss.loss_mode=flow_grpo`

Configuration

Diffusion training now uses dedicated diffusion config blocks. In verl_omni/trainer/config/diffusion_trainer.yaml, the main sections are:

algorithm: diffusion-specific advantage computation and normalization
actor_rollout_ref.actor: optimization and diffusion loss settings
actor_rollout_ref.rollout: rollout backend, sampling, and SDE controls
actor_rollout_ref.model: model path plus diffusion-model / LoRA settings
reward: reward manager, reward model, and custom reward function

The default diffusion model YAML mirrors rollout fields (pipeline and algo) into actor_rollout_ref.model.*, so in practice the rollout section is the main place to override sampling behavior.

Core parameters

Algorithm

algorithm.adv_estimator: Set to flow_grpo.

Actor / loss

actor_rollout_ref.actor.diffusion_loss.loss_mode: Set to flow_grpo.
actor_rollout_ref.actor.diffusion_loss.clip_ratio: clipping factor used in the diffusion loss.
actor_rollout_ref.actor.diffusion_loss.adv_clip_max: Maximum absolute advantage used before computing the policy loss.
actor_rollout_ref.actor.use_kl_loss: Enables KL loss against the reference policy.
actor_rollout_ref.actor.kl_loss_coef: Coefficient for the KL term when KL enabled.

Rollout / sampling

actor_rollout_ref.rollout.name: Selects the rollout backend. Currently supports vllm_omni.
actor_rollout_ref.rollout.n: Number of sampled image trajectories per prompt. This is the FlowGRPO group size and should be greater than 1.
actor_rollout_ref.rollout.algo.noise_level: Magnitude of SDE noise injected during rollout. Larger values increase diversity but can hurt image quality.
actor_rollout_ref.rollout.algo.sde_type: SDE variant for rollout. The current example uses sde.
actor_rollout_ref.rollout.algo.sde_window_size: Number of denoising steps included in the active training window. Smaller values reduce training cost.
actor_rollout_ref.rollout.algo.sde_window_range: Range used to sample the start of that active denoising window.
actor_rollout_ref.rollout.pipeline.num_inference_steps: Number of denoising steps used for rollout generation during training.
actor_rollout_ref.rollout.val_kwargs.pipeline.num_inference_steps: Number of denoising steps used during validation / evaluation.
actor_rollout_ref.rollout.pipeline.true_cfg_scale: True classifier-free guidance scale used during rollout. Used in Qwen-Image.
actor_rollout_ref.rollout.pipeline.guidance_scale: Distilled guidance scale for models that expose a guidance embedding; keep null to disable it.

Model

actor_rollout_ref.model.path: Base diffusion model path.
actor_rollout_ref.model.tokenizer_path: Optional tokenizer path if it is not located under the model path.
actor_rollout_ref.model.lora_rank: LoRA rank. Set to a positive integer to enable LoRA fine-tuning (e.g., 64).
actor_rollout_ref.model.lora_alpha: LoRA scaling factor (default 64).
actor_rollout_ref.model.lora_init_weights: LoRA initialization method (default "gaussian").
actor_rollout_ref.model.target_modules: Target modules for LoRA (default "all-linear").
actor_rollout_ref.model.lora_dtype: Optional dtype to convert LoRA parameters to for numerical stability during training (e.g., "fp32", "bf16"). Default null means no conversion.

Batch size

FlowGRPO uses three nested batch-size parameters that operate at different stages of the training loop. They address different concerns (RL sample diversity, multi-epoch reuse, and GPU memory) and must be understood together.

Step 1 — Rollout (data.train_batch_size)

data.train_batch_size is the number of unique prompts drawn from the dataset per training step. Before rollout, each prompt is replicated actor_rollout_ref.rollout.n times so that the rollout engine generates n independent image trajectories per prompt. The in-memory batch after rollout therefore holds train_batch_size × n image samples. GRPO advantage normalization runs over this full batch — it needs all n trajectories for every prompt to compute group-relative rewards before any splitting occurs.

Step 2 — Actor update (actor_rollout_ref.actor.ppo_mini_batch_size)

ppo_mini_batch_size controls how the full post-rollout batch is sliced for actor gradient updates. Important: this value is specified in prompts, not image samples. The trainer internally scales it by rollout.n to get the actual mini-batch size in samples:

effective mini-batch = ppo_mini_batch_size × rollout.n  (image samples)
number of mini-batches per epoch = train_batch_size / ppo_mini_batch_size

All n trajectories belonging to the same prompt are kept in the same mini-batch. This is not optional: although advantages are already computed globally before this split, the gradient update for each image depends on its advantage relative to the other images in its group. Scattering a prompt’s trajectories across different mini-batches would break that correspondence. ppo_mini_batch_size must divide train_batch_size evenly.

Step 3 — FSDP sharding and gradient accumulation (actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu)

Each mini-batch is distributed across GPUs by FSDP data parallelism, so each GPU receives (ppo_mini_batch_size × n) / n_gpus image samples. That per-GPU shard is then chunked into micro-batches of ppo_micro_batch_size_per_gpu for the actual forward/backward passes, with gradients accumulated across chunks before the optimizer step. This is pure gradient accumulation: the effective gradient is identical to running the full per-GPU shard in one shot; only peak activation memory changes.

For diffusion models the accumulation is two-dimensional: the engine also loops over each active denoising timestep inside every micro-batch, so the total gradient accumulation steps per GPU per mini-batch is:

gradient_accumulation_steps = (per_gpu_samples / ppo_micro_batch_size_per_gpu)
                              × sde_window_size

ppo_micro_batch_size_per_gpu must satisfy: (ppo_mini_batch_size × n) / n_gpus is divisible by ppo_micro_batch_size_per_gpu.

Concrete walkthrough (reference OCR script, 4 GPUs, sde_window_size=2):

data.train_batch_size              = 32    # 32 prompts loaded
actor_rollout_ref.rollout.n        = 16    # 16 images generated per prompt
  → post-rollout batch             = 512   # advantage computed over all 512

ppo_mini_batch_size (config)       = 16    # in prompts
  → effective mini-batch           = 16 × 16 = 256 samples
  → mini-batches per epoch         = 512 / 256 = 2 actor gradient steps

FSDP shards 256 samples across 4 GPUs:
  → per-GPU samples                = 256 / 4 = 64

ppo_micro_batch_size_per_gpu       = 16
  → micro-batches per GPU          = 64 / 16 = 4
  → gradient_accumulation_steps    = 4 × 2 (sde_window_size) = 8

Reward

reward.reward_manager.name: Selects the reward manager.
reward.custom_reward_function.path and reward.custom_reward_function.name: Register the task-specific reward post-processing function such as compute_score_ocr.

For an end-to-end OCR training walkthrough, including dataset preparation and the full runnable command, see docs/start/flowgrpo_quickstart.md.

Reference Example

Standard LoRA training with OCR reward (Qwen-Image, 4 GPUs) using the current vllm_omni rollout example:

bash examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_lora.sh

Variants

Rule-Based Reward Training: JPEG incompressibility

FlowGRPO also supports rule-based rewards that score images directly without a VLM reward model, reusing the default VisualRewardManager from verl_omni/trainer/config/reward/reward.yaml.

verl_omni/utils/reward_score/jpeg_compressibility.py rewards images that are harder to JPEG-compress (richer texture, more complex content). No extra dependencies or reward model process are required.

Minimal dataset row:

{
    "data_source": "jpeg_compressibility",
    "prompt": [{"role": "user", "content": "<your prompt>"}],
    "reward_model": {"ground_truth": ""},  # required by schema, ignored by scorer
}

Config changes relative to the OCR example — remove these lines:

reward.reward_model.enable=True
reward.reward_model.model_path=...
reward.reward_model.rollout.name=...
reward.reward_model.rollout.tensor_model_parallel_size=...
reward.custom_reward_function.path=...
reward.custom_reward_function.name=...

Keep all actor/rollout settings unchanged; the visual reward manager is loaded from the default reward config.

Async Reward

For reward models that are expensive to evaluate (e.g., a VLM judge), the reward model can be allocated its own dedicated GPU resource pool and run asynchronously alongside the policy. This avoids blocking policy training on reward computation.

bash examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_lora_async_reward.sh

Full Model Training

We have provided a script to enable non-cfg full-weight Qwen-Image OCR training. The example is runnable on 4 NVIDIA H200 GPUs; enabling CFG requires more GPU resources.

bash examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr.sh

Sequence parallelism (Ulysses SP)

Ulysses SP is supported for diffusion model training and requires diffusers >= 0.38.0. It shards the sequence dimension across GPUs within a SP group, reducing per-GPU memory for long-sequence and high-resolution training.

actor_rollout_ref.actor.fsdp_config.ulysses_sequence_parallel_size: Number of GPUs in the SP group. Must be a divisor of the total GPU count. Set to 1 (default) to disable SP. Common values: 2, 4, 8.

When SP is enabled, FSDP data parallelism is automatically reduced:

dp_size = total_gpus / ulysses_sequence_parallel_size

For SP training, num_attention_heads must be divisible by ulysses_sequence_parallel_size.

A ready-to-use 4-GPU SP=2 example is provided:

bash examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_lora_sp2.sh

Citation

@article{liu2025flow,
  title={Flow-GRPO: Training Flow Matching Models via Online RL},
  author={Liu, Jie and Liu, Gongye and Liang, Jiajun and Li, Yangguang and Liu, Jiaheng and Wang, Xintao and Wan, Pengfei and Zhang, Di and Ouyang, Wanli},
  journal={arXiv preprint arXiv:2505.05470},
  year={2025}
}