# Flow-GRPO

Last updated: 05/13/2026.

Flow-GRPO ([paper](https://arxiv.org/abs/2505.05470), [code](https://github.com/yifan123/flow_grpo)) is the first method to integrate online policy gradient reinforcement learning into **flow matching** generative models (e.g., Stable Diffusion 3, FLUX). It enables direct reward optimization for tasks such as compositional text-to-image generation, visual text rendering, and human preference alignment, without modifying the standard inference pipeline.

Two core technical contributions make this possible:

1. **ODE-to-SDE Conversion**: Flow matching models natively use a deterministic ODE sampler. Flow-GRPO converts this ODE into an equivalent SDE that preserves the model's marginal distribution at every timestep. This introduces the stochasticity required for group sampling and RL exploration.

2. **Denoising Reduction**: Training on all denoising steps is expensive. Flow-GRPO reduces the number of *training* steps while keeping the original number of *inference* steps, significantly improving sampling efficiency without sacrificing reward performance.

Empirically, RL-tuned SD3.5-M with Flow-GRPO raises GenEval accuracy from 63% to 95% and visual text rendering accuracy from 59% to 92%.

## Key Components

- **Flow Matching Backbone**: operates on continuous-time flow matching models (e.g., SD3.5, FLUX) rather than discrete-token LLMs.
- **ODE-to-SDE Rollout**: generates a group of diverse image trajectories by injecting controlled noise via SDE sampling at selected denoising steps.
- **Denoising Reduction**: trains on a reduced subset of denoising steps (configurable via `sde_window_size` and `sde_window_range`) while inference uses the full step count.
- **Image Reward Models**: rewards are assigned by external reward models (e.g., GenEval, OCR, PickScore, aesthetic score) rather than rule-based verifiers.
- **No Critic**: like GRPO for LLMs, no separate value network is trained; advantages are computed from group-relative rewards.

## Key Differences: GRPO vs. Flow-GRPO

| Dimension | GRPO (LLM) | Flow-GRPO (Diffusion) |
|---|---|---|
| **Model type** | Autoregressive language model | Flow matching / diffusion model |
| **Action space** | Discrete token sequences | Continuous denoising trajectories (SDE paths) |
| **Rollout mechanism** | Sample `n` token sequences per prompt | Convert ODE to SDE; sample `n` image trajectories per prompt via stochastic denoising |
| **Log-probability** | Standard next-token log-prob | Log-prob of the SDE noise prediction at each selected denoising step |
| **Training steps** | All decoding steps are trivially identical in cost | Denoising Reduction: train on a small window of steps, infer with full steps |
| **Reward signal** | Rule-based verifiers or LLM judges on text | Image reward models (GenEval, OCR, PickScore, aesthetic, etc.) |
| **KL regularization** | KL penalty added to reward or directly to loss | KL-style regularization is available, but the exact setup depends on the training config |
| **CFG (guidance)** | Not applicable | CFG distillation occurs naturally; CFG can be disabled at both train and test time |
| **Advantage estimator** | `algorithm.adv_estimator=grpo` | `algorithm.adv_estimator=flow_grpo` |
| **Loss mode** | `actor_rollout_ref.actor.policy_loss.loss_mode` not diffusion-specific | `actor_rollout_ref.actor.diffusion_loss.loss_mode=flow_grpo` |

## Configuration

Diffusion training now uses dedicated diffusion config blocks. In `verl_omni/trainer/config/diffusion_trainer.yaml`,
the main sections are:

- `algorithm`: diffusion-specific advantage computation and normalization
- `actor_rollout_ref.actor`: optimization and diffusion loss settings
- `actor_rollout_ref.rollout`: rollout backend, sampling, and SDE controls
- `actor_rollout_ref.model`: model path plus diffusion-model / LoRA settings
- `reward`: reward manager, reward model, and custom reward function

The default diffusion model YAML mirrors rollout fields (`pipeline` and `algo`) into `actor_rollout_ref.model.*`, so in practice
the rollout section is the main place to override sampling behavior.

### Core parameters

#### Algorithm

- `algorithm.adv_estimator`: Set to `flow_grpo`.

#### Actor / loss

- `actor_rollout_ref.actor.diffusion_loss.loss_mode`: Set to `flow_grpo`.

- `actor_rollout_ref.actor.diffusion_loss.clip_ratio`: clipping
  factor used in the diffusion loss.

- `actor_rollout_ref.actor.diffusion_loss.adv_clip_max`: Maximum absolute
  advantage used before computing the policy loss.

- `actor_rollout_ref.actor.use_kl_loss`: Enables KL loss against the reference
  policy.

- `actor_rollout_ref.actor.kl_loss_coef`: Coefficient for the KL term when KL enabled.

#### Rollout / sampling

- `actor_rollout_ref.rollout.name`: Selects the rollout backend. Currently supports `vllm_omni`.

- `actor_rollout_ref.rollout.n`: Number of sampled image trajectories per
  prompt. This is the FlowGRPO group size and should be greater than `1`.

- `actor_rollout_ref.rollout.algo.noise_level`: Magnitude of SDE noise injected
  during rollout. Larger values increase diversity but can hurt image quality.

- `actor_rollout_ref.rollout.algo.sde_type`: SDE variant for rollout. The
  current example uses `sde`.

- `actor_rollout_ref.rollout.algo.sde_window_size`: Number of denoising steps
  included in the active training window. Smaller values reduce training cost.

- `actor_rollout_ref.rollout.algo.sde_window_range`: Range used to sample the
  start of that active denoising window.

- `actor_rollout_ref.rollout.pipeline.num_inference_steps`: Number of denoising steps
  used for rollout generation during training.

- `actor_rollout_ref.rollout.val_kwargs.pipeline.num_inference_steps`: Number of
  denoising steps used during validation / evaluation.

- `actor_rollout_ref.rollout.pipeline.true_cfg_scale`: True classifier-free guidance
  scale used during rollout. Used in `Qwen-Image`.

- `actor_rollout_ref.rollout.pipeline.guidance_scale`: Distilled guidance scale for
  models that expose a guidance embedding; keep `null` to disable it.

#### Model

- `actor_rollout_ref.model.path`: Base diffusion model path.

- `actor_rollout_ref.model.tokenizer_path`: Optional tokenizer path if it is
  not located under the model path.

- `actor_rollout_ref.model.lora_rank`: LoRA rank. Set to a positive integer
  to enable LoRA fine-tuning (e.g., `64`).

- `actor_rollout_ref.model.lora_alpha`: LoRA scaling factor (default `64`).

- `actor_rollout_ref.model.lora_init_weights`: LoRA initialization method
  (default `"gaussian"`).

- `actor_rollout_ref.model.target_modules`: Target modules for LoRA (default
  `"all-linear"`).

- `actor_rollout_ref.model.lora_dtype`: Optional dtype to convert LoRA
  parameters to for numerical stability during training (e.g., `"fp32"`,
  `"bf16"`). Default `null` means no conversion.

#### Batch size

FlowGRPO uses three nested batch-size parameters that operate at different
stages of the training loop. They address different concerns (RL sample
diversity, multi-epoch reuse, and GPU memory) and must be understood together.

**Step 1 — Rollout (`data.train_batch_size`)**

`data.train_batch_size` is the number of **unique prompts** drawn from the
dataset per training step. Before rollout, each prompt is replicated
`actor_rollout_ref.rollout.n` times so that the rollout engine generates `n`
independent image trajectories per prompt. The in-memory batch after rollout
therefore holds `train_batch_size × n` image samples. GRPO advantage
normalization runs over this **full** batch — it needs all `n` trajectories
for every prompt to compute group-relative rewards before any splitting occurs.

**Step 2 — Actor update (`actor_rollout_ref.actor.ppo_mini_batch_size`)**

`ppo_mini_batch_size` controls how the full post-rollout batch is sliced for
actor gradient updates. **Important:** this value is specified in **prompts**,
not image samples. The trainer internally scales it by `rollout.n` to get
the actual mini-batch size in samples:

```
effective mini-batch = ppo_mini_batch_size × rollout.n  (image samples)
number of mini-batches per epoch = train_batch_size / ppo_mini_batch_size
```

All `n` trajectories belonging to the same prompt are kept in the same
mini-batch. This is not optional: although advantages are already computed
globally before this split, the gradient update for each image depends on its
advantage relative to the other images in its group. Scattering a prompt's
trajectories across different mini-batches would break that correspondence.
`ppo_mini_batch_size` must divide `train_batch_size` evenly.

**Step 3 — FSDP sharding and gradient accumulation
(`actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu`)**

Each mini-batch is distributed across GPUs by FSDP data parallelism, so each
GPU receives `(ppo_mini_batch_size × n) / n_gpus` image samples. That
per-GPU shard is then **chunked into micro-batches** of
`ppo_micro_batch_size_per_gpu` for the actual forward/backward passes, with
gradients accumulated across chunks before the optimizer step. This is pure
gradient accumulation: the effective gradient is identical to running the full
per-GPU shard in one shot; only peak activation memory changes.

For diffusion models the accumulation is two-dimensional: the engine also
loops over each active denoising timestep inside every micro-batch, so the
total gradient accumulation steps per GPU per mini-batch is:

```
gradient_accumulation_steps = (per_gpu_samples / ppo_micro_batch_size_per_gpu)
                              × sde_window_size
```

`ppo_micro_batch_size_per_gpu` must satisfy:
`(ppo_mini_batch_size × n) / n_gpus` is divisible by
`ppo_micro_batch_size_per_gpu`.

**Concrete walkthrough** (reference OCR script, 4 GPUs, `sde_window_size=2`):

```
data.train_batch_size              = 32    # 32 prompts loaded
actor_rollout_ref.rollout.n        = 16    # 16 images generated per prompt
  → post-rollout batch             = 512   # advantage computed over all 512

ppo_mini_batch_size (config)       = 16    # in prompts
  → effective mini-batch           = 16 × 16 = 256 samples
  → mini-batches per epoch         = 512 / 256 = 2 actor gradient steps

FSDP shards 256 samples across 4 GPUs:
  → per-GPU samples                = 256 / 4 = 64

ppo_micro_batch_size_per_gpu       = 16
  → micro-batches per GPU          = 64 / 16 = 4
  → gradient_accumulation_steps    = 4 × 2 (sde_window_size) = 8
```

#### Reward

- `reward.reward_manager.name`: Selects the reward manager.

- `reward.custom_reward_function.path` and
  `reward.custom_reward_function.name`: Register the task-specific reward
  post-processing function such as `compute_score_ocr`.

For an end-to-end OCR training walkthrough, including dataset preparation and
the full runnable command, see `docs/start/flowgrpo_quickstart.md`.


## Reference Example

Standard LoRA training with OCR reward (Qwen-Image, 4 GPUs) using the current
`vllm_omni` rollout example:

```bash
bash examples/flowgrpo_trainer/run_qwen_image_ocr_lora.sh
```

## Variants

### Rule-Based Reward Training: JPEG incompressibility

FlowGRPO also supports rule-based rewards that score images directly without a
VLM reward model, reusing the default `VisualRewardManager` from
`verl_omni/trainer/config/reward/reward.yaml`.

`verl_omni/utils/reward_score/jpeg_compressibility.py` rewards images that are
harder to JPEG-compress (richer texture, more complex content). No extra
dependencies or reward model process are required.

Minimal dataset row:

```python
{
    "data_source": "jpeg_compressibility",
    "prompt": [{"role": "user", "content": "<your prompt>"}],
    "reward_model": {"ground_truth": ""},  # required by schema, ignored by scorer
}
```

Config changes relative to the OCR example — **remove** these lines:

```bash
reward.reward_model.enable=True
reward.reward_model.model_path=...
reward.reward_model.rollout.name=...
reward.reward_model.rollout.tensor_model_parallel_size=...
reward.custom_reward_function.path=...
reward.custom_reward_function.name=...
```

Keep all actor/rollout settings unchanged; the visual reward manager is loaded
from the default reward config.

### Async Reward


For reward models that are expensive to evaluate (e.g., a VLM judge), the reward model can be allocated its own dedicated GPU resource pool and run asynchronously alongside the policy. This avoids blocking policy training on reward computation.

```bash
bash examples/flowgrpo_trainer/run_qwen_image_ocr_lora_async_reward.sh
```

### Full Model Training

We have provided a script to enable non-cfg full-weight Qwen-Image OCR training. The example is runnable on 4 NVIDIA H200 GPUs; enabling CFG requires more GPU resources.

```bash
bash examples/flowgrpo_trainer/run_qwen_image_ocr.sh
```


### Sequence parallelism (Ulysses SP)

Ulysses SP is supported for diffusion model training and requires `diffusers` >= 0.38.0.
It shards the sequence dimension across GPUs within a SP group,
reducing per-GPU memory for long-sequence and high-resolution training.

- `actor_rollout_ref.actor.fsdp_config.ulysses_sequence_parallel_size`: Number
  of GPUs in the SP group. Must be a divisor of the total GPU count. Set to `1`
  (default) to disable SP. Common values: `2`, `4`, `8`.

When SP is enabled, FSDP data parallelism is automatically reduced:
```
dp_size = total_gpus / ulysses_sequence_parallel_size
```

For SP training, `num_attention_heads` must be divisible by
`ulysses_sequence_parallel_size`.

A ready-to-use 4-GPU SP=2 example is provided:
```bash
bash examples/flowgrpo_trainer/run_qwen_image_ocr_lora_sp2.sh
```


## Citation

```bibtex
@article{liu2025flow,
  title={Flow-GRPO: Training Flow Matching Models via Online RL},
  author={Liu, Jie and Liu, Gongye and Liang, Jiajun and Li, Yangguang and Liu, Jiaheng and Wang, Xintao and Wan, Pengfei and Zhang, Di and Ouyang, Wanli},
  journal={arXiv preprint arXiv:2505.05470},
  year={2025}
}
```