Quickstart: FlowGRPO training on OCR dataset

Last updated: 06/24/2026

Post-train a diffusion image generation model with FlowGRPO.

Introduction

In this example, we post-train a Stable Diffusion 3.5 Medium policy with FlowGRPO for OCR-style image generation tasks. The rollout uses vllm-omni for multimodal generation, and the reward is computed by a visual generative reward model (Qwen2.5-VL-3B-Instruct in this example) that compares OCR text extracted from generated images against the dataset ground truth.

Prerequisite

Install VeRL-Omni and its dependencies following the installation guide. Also install the FlowGRPO-specific reward dependency:

pip install Levenshtein

Use a machine with 3 GPUs for the provided example script (2 for actor + rollout, 1 for the reward model in its own resource pool).
Run the commands below from the repository root.

Dataset Introduction

We use the OCR dataset from the original Flow-GRPO repository: dataset/ocr. Each sample asks the model to generate an image that contains specific text, and the reward model scores the generated image by reading the rendered text and comparing it with the reference OCR string.

The raw dataset is a plain-text file (train.txt / test.txt) where each line is one generation prompt. The OCR target — the text the model must render in the image — is enclosed in double quotes within the prompt. A few representative samples:

A close-up of a medicine bottle with a clear, red warning label that reads "Take With Food" prominently displayed, set against a neutral background.
A close-up of a robot's chest panel, with a digital display blinking "System Override Active" in red, set against a dimly lit industrial background.
A detailed textbook diagram labeled "Photosynthesis Process", viewed under a high-powered microscope, showcasing the intricate cellular structures and chemical reactions involved.
An ancient, leather-bound wizard's spellbook lies open, revealing a worn, yellowed page. A delicate bookmark rests precisely on "Page 666", casting a subtle glow that illuminates the arcane text.
An astronaut's boot print on the Martian surface, clearly reading "First Steps", surrounded by the red, dusty terrain under a pale, distant sky.

The preprocessing script converts the raw dataset into parquet files that contain:

the raw prompt text for image generation (SD3.5 uses its own CLIP-L/G + T5 text encoders, so no system prompt or chat template is applied),
OCR ground truth stored under reward_model.ground_truth,
auxiliary metadata such as split and sample index.

Step 1: Prepare the dataset

Set the WORKSPACE environment variable to any writable directory you prefer (defaults to $HOME if unset):

export WORKSPACE=${WORKSPACE:-$HOME}

Obtain the raw OCR dataset from the original Flow-GRPO repository and place it under $WORKSPACE/data/ocr. Then preprocess it into train.parquet and test.parquet:

python3 examples/flowgrpo_trainer/data_process/sd3_ocr.py \
  --input_dir $WORKSPACE/data/ocr \
  --output_dir $WORKSPACE/data/ocr/sd3

The command above writes:

$WORKSPACE/data/ocr/sd3/train.parquet
$WORKSPACE/data/ocr/sd3/test.parquet

These parquet files are the inputs consumed by the FlowGRPO training script.

Preparing a custom dataset

To train on your own OCR-style data, create train.txt and test.txt following the same one-prompt-per-line convention. Each prompt must contain the target OCR string enclosed in double quotes — the preprocessing script extracts the text between the first pair of quotes as the ground truth. For example:

A vintage storefront sign above the door reads "Open 24 Hours" in bold neon letters.
A handwritten sticky note on a refrigerator says "Buy milk" in blue ink.

Place the files in $WORKSPACE/data/ocr/ (or any directory you prefer) and run the same preprocessing command, adjusting --input_dir and --output_dir as needed:

python3 examples/flowgrpo_trainer/data_process/sd3_ocr.py \
  --input_dir $WORKSPACE/data/ocr \
  --output_dir $WORKSPACE/data/ocr/sd3

For datasets with a different ground-truth extraction scheme (e.g. a CSV with an explicit label column), modify extract_solution and the process_fn function in examples/flowgrpo_trainer/data_process/sd3_ocr.py to match your format, then re-run the script to regenerate the parquet files.

Step 2: Obtain models for RL training

In this example, we train stabilityai/stable-diffusion-3.5-medium with LoRA and use Qwen/Qwen2.5-VL-3B-Instruct as the OCR reward model.

Policy model (SD3.5 Medium): the script uses the Hugging Face Hub ID stabilityai/stable-diffusion-3.5-medium directly — no manual download is required. Hugging Face will cache the weights automatically on first run. To use a local copy instead, edit the model_name variable in the script directly.

Reward model (Qwen2.5-VL-3B-Instruct): the script defaults to the Hugging Face Hub ID Qwen/Qwen2.5-VL-3B-Instruct, so no manual download is required — Hugging Face will cache it automatically on first run. To use a local copy instead, edit the reward_model_name variable in the script directly.

Custom chat template: Since SD3.5 runs its own CLIP-L/G + T5 text encoders on the raw prompt text, the script sets a minimal chat template that extracts only the user message content:

custom_chat_template='{% for message in messages %}{% if message['\''role'\''] == '\''user'\'' %}{{ message['\''content'\''] }}{% endif %}{% endfor %}'

The run script exposes the following environment variables:

WORKSPACE              # base directory for data (default: $HOME)

Step 3: Perform FlowGRPO training

The provided example script launches python3 -m verl_omni.trainer.main_diffusion with the FlowGRPO-specific config needed for this OCR task:

algorithm.adv_estimator=flow_grpo
actor_rollout_ref.rollout.name=vllm_omni
reward.custom_reward_function.path=verl_omni/utils/reward_score/genrm_ocr.py
reward.custom_reward_function.name=compute_score_ocr
LoRA fine-tuning on stabilityai/stable-diffusion-3.5-medium
a single-node layout: 2 GPUs for actor + rollout, 1 GPU for the reward model in its own dedicated resource pool
image resolution 384×384, 10 inference steps per rollout sample

Run the training script:

bash examples/flowgrpo_trainer/sd35/run_sd35_medium_ocr_lora.sh

Optional KL loss tuning:

actor_rollout_ref.actor.use_kl_loss=True
actor_rollout_ref.actor.kl_loss_coef=0.001

The script uses $WORKSPACE (default: $HOME) as the base directory. Override any path via the environment variables described in Step 2, or set WORKSPACE to point to a volume with enough free space before launching.

You are expected to see training, validation, actor, critic, and reward metrics logged through the configured backends. By default, checkpoints are saved under:

checkpoints/${trainer.project_name}/${trainer.experiment_name}

FAQ: tuning OOM-related parameters

OOM location	First parameter to tune	What it changes
Rollout generation OOM	Increase `ROLLOUT_TP`	Sets `actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP` and reduces `actor_rollout_ref.rollout.agent.num_workers` to `NUM_GPUS / ROLLOUT_TP`. This shards the rollout model and lowers rollout request concurrency.
Reward-model OOM	Increase `REWARD_TP`	Sets `reward.reward_model.rollout.tensor_model_parallel_size=$REWARD_TP` and reduces `reward.num_workers` to `NUM_GPUS / REWARD_TP`. This shards the reward model and lowers reward request concurrency.
Actor loss forward/backward OOM	Decrease `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu`	Splits each actor mini-batch into smaller per-GPU chunks and accumulates gradients across chunks. This lowers activation memory without changing the effective batch size, but can reduce throughput.
Old log-prob recomputation OOM	Decrease `actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu`	Splits actor-side log-prob inference into smaller per-GPU chunks during the `old_log_prob` step.
Reference log-prob OOM	Decrease `actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu`	Splits reference-policy log-prob inference into smaller per-GPU chunks. In LoRA runs where the reference is served by the actor with adapters disabled, the actor log-prob path is used instead.

If rollout OOM persists after increasing ROLLOUT_TP, reduce memory-heavy rollout settings such as actor_rollout_ref.rollout.n, image height / width, or actor_rollout_ref.rollout.pipeline.max_sequence_length. If reward-model OOM persists after increasing REWARD_TP, consider the async reward script, which places the reward model on its own resource pool via reward.reward_model.enable_resource_pool=True.

FAQ: common errors

Error	Fix	What it changes
`Attention backend mismatch`	Set `actor_rollout_ref.rollout.rollout_attn_backend=TORCH_SDPA`	When FA3 is unavailable, the trainer falls back `attn_backend` to `native` automatically, but `rollout_attn_backend` stays at `FLASH_ATTN` — raising a mismatch error. Set the rollout backend explicitly to `TORCH_SDPA`, or export `DIFFUSION_ATTENTION_BACKEND=TORCH_SDPA` before launching.

Wandb logging

The provided script already enables:

trainer.logger='["console", "wandb"]' \
trainer.project_name=flow_grpo \
trainer.experiment_name=sd35_medium_ocr_lora

Set your W&B credentials before launching if you want remote tracking:

export WANDB_API_KEY=<your_wandb_api_key>

You can also override trainer.project_name and trainer.experiment_name from the command line to organize runs under your own project names.