Performance Reference

Last updated: 06/05/2026

Below are reference benchmark results for VeRL-Omni training runs.

FlowGRPO: LoRA Training on Qwen-Image OCR

All experiments used NVIDIA H800 GPUs, LoRA rank 64, ppo_micro_batch_size_per_gpu 16, and the full 1k validation set. Training images per step = batch size × images per prompt = 32 × 16 = 512.

Experiment Settings and Throughput

Script

# GPUs

# GPUs for Actor

# GPUs for Rollout

# GPUs for Async Reward

Batch Size

Images per Prompt

LR

Throughput (images/GPU/s)

Time per Step (s)

run_qwen_image_ocr_lora.sh

4

4

4

0 (sync)

32

16

3e-4

0.305

420

run_qwen_image_ocr_lora_async_reward.sh

5

4

4

1

32

16

3e-4

0.280

360

Training - Zero Standard Deviation Ratio and Reward Curve

LoRA FlowGRPO OCR training zero standard deviation ratio and reward curve
  • qwen_image_ocr_lora: sync reward, 4 GPUs (run_qwen_image_ocr_lora.sh)

  • qwen_image_ocr_lora_async_reward: async reward on a dedicated 5th GPU (run_qwen_image_ocr_lora_async_reward.sh)

Validation Reward Curve

Evaluated with trainer.val_before_train=True:

LoRA FlowGRPO OCR validation reward curve
  • qwen_image_ocr_lora: sync reward, 4 GPUs (run_qwen_image_ocr_lora.sh)

  • qwen_image_ocr_lora_async_reward: async reward on a dedicated 5th GPU (run_qwen_image_ocr_lora_async_reward.sh)

Note: Reward curves may differ from the references above mainly due to rollout-side stochasticity: diffusion rollouts sample random latents/noise, and the example scripts do not fix the data seed, so prompt ordering can vary between runs.

FlowGRPO: non-CFG Full Model Training on Qwen-Image OCR

Experiments used NVIDIA H200 GPUs, lr 3e-5, clip_ratio 1e-5, optimizer state fp32. The other parameters are consistent with the LoRA setting.

Note that the initial reward is expected to be low for non-CFG full model training.

Full-Model Experiment Settings and Throughput

Script

# GPUs

# GPUs for Actor

# GPUs for Rollout

# GPUs for Async Reward

Batch Size

Images per Prompt

LR

Throughput (images/GPU/s)

Time per Step (s)

run_qwen_image_ocr.sh

4

4

4

0 (sync)

32

16

3e-5

0.510

250

Full-Model Training - Zero Standard Deviation Ratio and Reward Curve

Full Model FlowGRPO OCR training zero standard deviation ratio and reward curve

Training - Clip Fraction

Full Model FlowGRPO OCR training Clip Fraction

Full-Model Validation Reward Curve

Full Model FlowGRPO OCR validation reward curve

FlowGRPO non-CFG Full Model: VeOmni vs FSDP1 Backend (same config)

Apples-to-apples comparison: the VeOmni and FSDP1 actor engines run the same FlowGRPO recipe — same algorithm, data, and hyper-parameters — on the same hardware (64 × NVIDIA H100), differing only in the training engine. lr 3e-5, clip_ratio 1e-5, optimizer state fp32; other parameters match the LoRA setting.

  • FSDP1run_qwen_image_ocr.sh

  • VeOmnirun_qwen_image_ocr_veomni.sh (see the install guide “Optional engine backends”)

Settings and Throughput

Backend

Script

GPU name

# GPUs

# GPUs for Actor

# GPUs for Rollout

# GPUs for Async Reward

Batch Size

Images per Prompt

LR

Throughput (images/GPU/s)

Time per Step (s)

VeOmni

run_qwen_image_ocr_veomni.sh

H100

64

64

64

0 (sync)

32

16

3e-5

0.079

100

FSDP1

run_qwen_image_ocr.sh

H100

64

64

64

0 (sync)

32

16

3e-5

0.077

105

Note: VeOmni and FSDP1 run with actor_rollout_ref.actor.veomni_config.param_offload=False, actor_rollout_ref.actor.veomni_config.optimizer_offload=True, and SP=1.

Full-Model Training - Zero Standard Deviation Ratio and Reward Curve

image image

Training - Clip Fraction

image

Full-Model Validation Reward Curve

image