Performance Reference
Last updated: 06/05/2026
Below are reference benchmark results for VeRL-Omni training runs.
FlowGRPO: LoRA Training on Qwen-Image OCR
All experiments used NVIDIA H800 GPUs, LoRA rank 64,
ppo_micro_batch_size_per_gpu16, and the full 1k validation set. Training images per step = batch size × images per prompt = 32 × 16 = 512.
Experiment Settings and Throughput
Script |
# GPUs |
# GPUs for Actor |
# GPUs for Rollout |
# GPUs for Async Reward |
Batch Size |
Images per Prompt |
LR |
Throughput (images/GPU/s) |
Time per Step (s) |
|---|---|---|---|---|---|---|---|---|---|
|
4 |
4 |
4 |
0 (sync) |
32 |
16 |
3e-4 |
0.305 |
420 |
|
5 |
4 |
4 |
1 |
32 |
16 |
3e-4 |
0.280 |
360 |
Training - Zero Standard Deviation Ratio and Reward Curve
qwen_image_ocr_lora: sync reward, 4 GPUs (run_qwen_image_ocr_lora.sh)qwen_image_ocr_lora_async_reward: async reward on a dedicated 5th GPU (run_qwen_image_ocr_lora_async_reward.sh)
Validation Reward Curve
Evaluated with trainer.val_before_train=True:
qwen_image_ocr_lora: sync reward, 4 GPUs (run_qwen_image_ocr_lora.sh)qwen_image_ocr_lora_async_reward: async reward on a dedicated 5th GPU (run_qwen_image_ocr_lora_async_reward.sh)
Note: Reward curves may differ from the references above mainly due to rollout-side stochasticity: diffusion rollouts sample random latents/noise, and the example scripts do not fix the data seed, so prompt ordering can vary between runs.
FlowGRPO: non-CFG Full Model Training on Qwen-Image OCR
Experiments used NVIDIA H200 GPUs, lr 3e-5, clip_ratio 1e-5, optimizer state fp32. The other parameters are consistent with the LoRA setting.
Note that the initial reward is expected to be low for non-CFG full model training.
Full-Model Experiment Settings and Throughput
Script |
# GPUs |
# GPUs for Actor |
# GPUs for Rollout |
# GPUs for Async Reward |
Batch Size |
Images per Prompt |
LR |
Throughput (images/GPU/s) |
Time per Step (s) |
|---|---|---|---|---|---|---|---|---|---|
|
4 |
4 |
4 |
0 (sync) |
32 |
16 |
3e-5 |
0.510 |
250 |
Full-Model Training - Zero Standard Deviation Ratio and Reward Curve
Training - Clip Fraction
Full-Model Validation Reward Curve
FlowGRPO non-CFG Full Model: VeOmni vs FSDP1 Backend (same config)
Apples-to-apples comparison: the VeOmni and FSDP1 actor engines run the same FlowGRPO recipe — same algorithm, data, and hyper-parameters — on the same hardware (64 × NVIDIA H100), differing only in the training engine. lr 3e-5, clip_ratio 1e-5, optimizer state fp32; other parameters match the LoRA setting.
FSDP1 —
run_qwen_image_ocr.shVeOmni —
run_qwen_image_ocr_veomni.sh(see the install guide “Optional engine backends”)
Settings and Throughput
Backend |
Script |
GPU name |
# GPUs |
# GPUs for Actor |
# GPUs for Rollout |
# GPUs for Async Reward |
Batch Size |
Images per Prompt |
LR |
Throughput (images/GPU/s) |
Time per Step (s) |
|---|---|---|---|---|---|---|---|---|---|---|---|
VeOmni |
|
H100 |
64 |
64 |
64 |
0 (sync) |
32 |
16 |
3e-5 |
0.079 |
100 |
FSDP1 |
|
H100 |
64 |
64 |
64 |
0 (sync) |
32 |
16 |
3e-5 |
0.077 |
105 |
Note: VeOmni and FSDP1 run with
actor_rollout_ref.actor.veomni_config.param_offload=False,actor_rollout_ref.actor.veomni_config.optimizer_offload=True, andSP=1.