Performance Reference

Last updated: 06/17/2026

Below are reference benchmark results for VeRL-Omni training runs.

FlowGRPO: LoRA Training on Qwen-Image OCR

All experiments used NVIDIA H800 GPUs, LoRA rank 64, ppo_micro_batch_size_per_gpu 16, and the full 1k validation set. Training images per step = batch size × images per prompt = 32 × 16 = 512.

Experiment Settings and Throughput

Script	# GPUs	# GPUs for Actor	# GPUs for Rollout	# GPUs for Async Reward	Batch Size	Images per Prompt	LR	Throughput (images/GPU/s)	Time per Step (s)
`run_qwen_image_ocr_lora.sh`	4	4	4	0 (sync)	32	16	3e-4	0.305	420
`run_qwen_image_ocr_lora_async_reward.sh`	5	4	4	1	32	16	3e-4	0.280	360

Training - Zero Standard Deviation Ratio and Reward Curve

qwen_image_ocr_lora: sync reward, 4 GPUs (run_qwen_image_ocr_lora.sh)
qwen_image_ocr_lora_async_reward: async reward on a dedicated 5th GPU (run_qwen_image_ocr_lora_async_reward.sh)

Validation Reward Curve

Evaluated with trainer.val_before_train=True:

qwen_image_ocr_lora: sync reward, 4 GPUs (run_qwen_image_ocr_lora.sh)
qwen_image_ocr_lora_async_reward: async reward on a dedicated 5th GPU (run_qwen_image_ocr_lora_async_reward.sh)

Note: Reward curves may differ from the references above mainly due to rollout-side stochasticity: diffusion rollouts sample random latents/noise, and the example scripts do not fix the data seed, so prompt ordering can vary between runs.

FlowGRPO: non-CFG Full Model Training on Qwen-Image OCR

Experiments used NVIDIA H200 GPUs, lr 3e-5, clip_ratio 1e-5, optimizer state fp32. The other parameters are consistent with the LoRA setting.

Note that the initial reward is expected to be low for non-CFG full model training.

Full-Model Experiment Settings and Throughput

Script	# GPUs	# GPUs for Actor	# GPUs for Rollout	# GPUs for Async Reward	Batch Size	Images per Prompt	LR	Throughput (images/GPU/s)	Time per Step (s)
`run_qwen_image_ocr.sh`	4	4	4	0 (sync)	32	16	3e-5	0.510	250

Reference wandb curve here.

Full-Model Training - Zero Standard Deviation Ratio and Reward Curve

Full Model FlowGRPO OCR training zero standard deviation ratio and reward curve

Training - Clip Fraction

Full-Model Validation Reward Curve

Full Model FlowGRPO OCR validation reward curve

FlowGRPO non-CFG Full Model: VeOmni vs FSDP1 Backend (same config)

Apples-to-apples comparison: the VeOmni and FSDP1 actor engines run the same FlowGRPO recipe — same algorithm, data, and hyper-parameters — on the same hardware (64 × NVIDIA H100), differing only in the training engine. lr 3e-5, clip_ratio 1e-5, optimizer state fp32; other parameters match the LoRA setting.

FSDP1 — run_qwen_image_ocr.sh
VeOmni — run_qwen_image_ocr_veomni.sh (see the install guide “Optional engine backends”)

Settings and Throughput

Backend	Script	GPU name	# GPUs	# GPUs for Actor	# GPUs for Rollout	# GPUs for Async Reward	Batch Size	Images per Prompt	LR	Throughput (images/GPU/s)	Time per Step (s)
VeOmni	`run_qwen_image_ocr_veomni.sh`	H100	64	64	64	0 (sync)	32	16	3e-5	0.079	100
FSDP1	`run_qwen_image_ocr.sh`	H100	64	64	64	0 (sync)	32	16	3e-5	0.077	105

Note: VeOmni and FSDP1 run with actor_rollout_ref.actor.veomni_config.param_offload=False, actor_rollout_ref.actor.veomni_config.optimizer_offload=True, and SP=1.

Full-Model Training - Zero Standard Deviation Ratio and Reward Curve

Training - Clip Fraction

Full-Model Validation Reward Curve

FlowDPPO: LoRA Training on Qwen-Image OCR

All experiments used NVIDIA H200 GPUs, LoRA rank 64, ppo_micro_batch_size_per_gpu 16, and the full 1k validation set. Training images per step = batch size × images per prompt = 32 × 16 = 512.

Script	# GPUs	# GPUs for Actor	# GPUs for Rollout	# GPUs for Async Reward	Batch Size	Images per Prompt	LR	Throughput (images/GPU/s)	Time per Step (s)
`run_qwen_image_ocr_lora.sh`	4	4	4	0 (sync)	32	16	3e-4	0.240	540

FlowDPPO LoRA OCR training zero standard deviation ratio and reward curve

LoRA Validation Reward Curve

FlowDPPO LoRA OCR training validation curve

DiffusionNFT: non-CFG LoRA Training on Qwen-Image OCR

All experiments used NVIDIA H200 GPUs, LoRA rank 64, ppo_micro_batch_size_per_gpu 16, and the full 1k validation set. Training images per step = batch size × images per prompt = 32 × 16 = 512.

Script	# GPUs	# GPUs for Actor	# GPUs for Rollout	# GPUs for Async Reward	Batch Size	Images per Prompt	LR	Throughput (images/GPU/s)	Time per Step (s)
`run_qwen_image_ocr_lora.sh`	4	4	4	0 (sync)	24	12	3e-4	0.175	550

Reference wandb curve here.

DiffusionNFT LoRA OCR training zero standard deviation ratio and reward curve

LoRA Validation Reward Curve

DiffusionNFT LoRA OCR training validation curve