Quickstart: FlowGRPO training on OCR dataset
Last updated: 06/24/2026
Post-train a diffusion image generation model with FlowGRPO.
Introduction
In this example, we post-train a Stable Diffusion 3.5 Medium policy with FlowGRPO for OCR-style image generation tasks. The rollout uses vllm-omni for multimodal generation, and the reward is computed by a visual generative reward model (Qwen2.5-VL-3B-Instruct in this example) that compares OCR text extracted from generated images against the dataset ground truth.
Prerequisite
Install VeRL-Omni and its dependencies following the installation guide. Also install the FlowGRPO-specific reward dependency:
pip install Levenshtein
Use a machine with
3GPUs for the provided example script (2for actor + rollout,1for the reward model in its own resource pool).Run the commands below from the repository root.
Dataset Introduction
We use the OCR dataset from the original Flow-GRPO repository: dataset/ocr. Each sample asks the model to generate an image that contains specific text, and the reward model scores the generated image by reading the rendered text and comparing it with the reference OCR string.
The raw dataset is a plain-text file (train.txt / test.txt) where each line is one generation prompt. The OCR target — the text the model must render in the image — is enclosed in double quotes within the prompt. A few representative samples:
A close-up of a medicine bottle with a clear, red warning label that reads "Take With Food" prominently displayed, set against a neutral background.
A close-up of a robot's chest panel, with a digital display blinking "System Override Active" in red, set against a dimly lit industrial background.
A detailed textbook diagram labeled "Photosynthesis Process", viewed under a high-powered microscope, showcasing the intricate cellular structures and chemical reactions involved.
An ancient, leather-bound wizard's spellbook lies open, revealing a worn, yellowed page. A delicate bookmark rests precisely on "Page 666", casting a subtle glow that illuminates the arcane text.
An astronaut's boot print on the Martian surface, clearly reading "First Steps", surrounded by the red, dusty terrain under a pale, distant sky.
The preprocessing script converts the raw dataset into parquet files that contain:
the raw prompt text for image generation (SD3.5 uses its own CLIP-L/G + T5 text encoders, so no system prompt or chat template is applied),
OCR ground truth stored under
reward_model.ground_truth,auxiliary metadata such as split and sample index.
Step 1: Prepare the dataset
Set the WORKSPACE environment variable to any writable directory you prefer (defaults to $HOME if unset):
export WORKSPACE=${WORKSPACE:-$HOME}
Obtain the raw OCR dataset from the original Flow-GRPO repository and place it under $WORKSPACE/data/ocr. Then preprocess it into train.parquet and test.parquet:
python3 examples/flowgrpo_trainer/data_process/sd3_ocr.py \
--input_dir $WORKSPACE/data/ocr \
--output_dir $WORKSPACE/data/ocr/sd3
The command above writes:
$WORKSPACE/data/ocr/sd3/train.parquet$WORKSPACE/data/ocr/sd3/test.parquet
These parquet files are the inputs consumed by the FlowGRPO training script.
Preparing a custom dataset
To train on your own OCR-style data, create train.txt and test.txt following the same one-prompt-per-line convention. Each prompt must contain the target OCR string enclosed in double quotes — the preprocessing script extracts the text between the first pair of quotes as the ground truth. For example:
A vintage storefront sign above the door reads "Open 24 Hours" in bold neon letters.
A handwritten sticky note on a refrigerator says "Buy milk" in blue ink.
Place the files in $WORKSPACE/data/ocr/ (or any directory you prefer) and run the same preprocessing command, adjusting --input_dir and --output_dir as needed:
python3 examples/flowgrpo_trainer/data_process/sd3_ocr.py \
--input_dir $WORKSPACE/data/ocr \
--output_dir $WORKSPACE/data/ocr/sd3
For datasets with a different ground-truth extraction scheme (e.g. a CSV with an explicit label column), modify extract_solution and the process_fn function in examples/flowgrpo_trainer/data_process/sd3_ocr.py to match your format, then re-run the script to regenerate the parquet files.
Step 2: Obtain models for RL training
In this example, we train stabilityai/stable-diffusion-3.5-medium with LoRA and use Qwen/Qwen2.5-VL-3B-Instruct as the OCR reward model.
Policy model (SD3.5 Medium): the script uses the Hugging Face Hub ID stabilityai/stable-diffusion-3.5-medium directly — no manual download is required. Hugging Face will cache the weights automatically on first run. To use a local copy instead, edit the model_name variable in the script directly.
Reward model (Qwen2.5-VL-3B-Instruct): the script defaults to the Hugging Face Hub ID Qwen/Qwen2.5-VL-3B-Instruct, so no manual download is required — Hugging Face will cache it automatically on first run. To use a local copy instead, edit the reward_model_name variable in the script directly.
Custom chat template: Since SD3.5 runs its own CLIP-L/G + T5 text encoders on the raw prompt text, the script sets a minimal chat template that extracts only the user message content:
custom_chat_template='{% for message in messages %}{% if message['\''role'\''] == '\''user'\'' %}{{ message['\''content'\''] }}{% endif %}{% endfor %}'
The run script exposes the following environment variables:
WORKSPACE # base directory for data (default: $HOME)
Step 3: Perform FlowGRPO training
The provided example script launches python3 -m verl_omni.trainer.main_diffusion with the FlowGRPO-specific config needed for this OCR task:
algorithm.adv_estimator=flow_grpoactor_rollout_ref.rollout.name=vllm_omnireward.custom_reward_function.path=verl_omni/utils/reward_score/genrm_ocr.pyreward.custom_reward_function.name=compute_score_ocrLoRA fine-tuning on
stabilityai/stable-diffusion-3.5-mediuma single-node layout:
2GPUs for actor + rollout,1GPU for the reward model in its own dedicated resource poolimage resolution
384×384,10inference steps per rollout sample
Run the training script:
bash examples/flowgrpo_trainer/sd35/run_sd35_medium_ocr_lora.sh
Optional KL loss tuning:
actor_rollout_ref.actor.use_kl_loss=Trueactor_rollout_ref.actor.kl_loss_coef=0.001
The script uses $WORKSPACE (default: $HOME) as the base directory. Override any path via the environment variables described in Step 2, or set WORKSPACE to point to a volume with enough free space before launching.
You are expected to see training, validation, actor, critic, and reward metrics logged through the configured backends. By default, checkpoints are saved under:
checkpoints/${trainer.project_name}/${trainer.experiment_name}
FAQ: common errors
Error |
Fix |
What it changes |
|---|---|---|
|
Set |
When FA3 is unavailable, the trainer falls back |
Wandb logging
The provided script already enables:
trainer.logger='["console", "wandb"]' \
trainer.project_name=flow_grpo \
trainer.experiment_name=sd35_medium_ocr_lora
Set your W&B credentials before launching if you want remote tracking:
export WANDB_API_KEY=<your_wandb_api_key>
You can also override trainer.project_name and trainer.experiment_name from the command line to organize runs under your own project names.
Further reading
For the algorithm background, detailed configuration notes, async reward, and rule-based reward training (e.g. JPEG incompressibility), see:
To scale training across multiple nodes, follow the multi-node guide.