(performance)= # Performance Reference Last updated: 06/05/2026 Below are reference benchmark results for VeRL-Omni training runs. ## FlowGRPO: LoRA Training on Qwen-Image OCR > All experiments used NVIDIA H800 GPUs, LoRA rank 64, `ppo_micro_batch_size_per_gpu` 16, and the full 1k validation set. Training images per step = batch size × images per prompt = 32 × 16 = 512. ### Experiment Settings and Throughput | Script | # GPUs | # GPUs for Actor | # GPUs for Rollout | # GPUs for Async Reward | Batch Size | Images per Prompt | LR | Throughput (images/GPU/s) | Time per Step (s) | |--------|--------|------------------|--------------------|-------------------------|------------|-------------------|----|-----------------------|-------------------| | `run_qwen_image_ocr_lora.sh` | 4 | 4 | 4 | 0 (sync) | 32 | 16 | 3e-4 | 0.305 | 420 | | `run_qwen_image_ocr_lora_async_reward.sh` | 5 | 4 | 4 | 1 | 32 | 16 | 3e-4 | 0.280 | 360 | ### Training - Zero Standard Deviation Ratio and Reward Curve

LoRA FlowGRPO OCR training zero standard deviation ratio and reward curve

LoRA FlowGRPO OCR validation reward curve

- `qwen_image_ocr_lora`: sync reward, 4 GPUs (`run_qwen_image_ocr_lora.sh`) - `qwen_image_ocr_lora_async_reward`: async reward on a dedicated 5th GPU (`run_qwen_image_ocr_lora_async_reward.sh`) > **Note:** Reward curves may differ from the references above mainly due to rollout-side stochasticity: diffusion rollouts sample random latents/noise, and the example scripts do not fix the data seed, so prompt ordering can vary between runs. ## FlowGRPO: non-CFG Full Model Training on Qwen-Image OCR > Experiments used NVIDIA H200 GPUs, lr 3e-5, clip_ratio 1e-5, optimizer state fp32. The other parameters are consistent with the LoRA setting. > Note that the initial reward is expected to be low for non-CFG full model training. ### Full-Model Experiment Settings and Throughput | Script | # GPUs | # GPUs for Actor | # GPUs for Rollout | # GPUs for Async Reward | Batch Size | Images per Prompt | LR | Throughput (images/GPU/s) | Time per Step (s) | |--------|--------|------------------|--------------------|-------------------------|------------|-------------------|----|-----------------------|-------------------| | `run_qwen_image_ocr.sh` | 4 | 4 | 4 | 0 (sync) | 32 | 16 | 3e-5 | 0.510 | 250 | ### Full-Model Training - Zero Standard Deviation Ratio and Reward Curve

Full Model FlowGRPO OCR training zero standard deviation ratio and reward curve

### Training - Clip Fraction

### Full-Model Validation Reward Curve

Full Model FlowGRPO OCR validation reward curve

## FlowGRPO non-CFG Full Model: VeOmni vs FSDP1 Backend (same config) > Apples-to-apples comparison: the **VeOmni** and **FSDP1** actor engines run the *same* FlowGRPO recipe — same algorithm, data, and hyper-parameters — on the *same* hardware (64 × NVIDIA H100), differing only in the training engine. lr 3e-5, clip_ratio 1e-5, optimizer state fp32; other parameters match the LoRA setting. - **FSDP1** — `run_qwen_image_ocr.sh` - **VeOmni** — `run_qwen_image_ocr_veomni.sh` (see the [install guide](../start/install.md) "Optional engine backends") ### Settings and Throughput | Backend | Script | GPU name | # GPUs | # GPUs for Actor | # GPUs for Rollout | # GPUs for Async Reward | Batch Size | Images per Prompt | LR | Throughput (images/GPU/s) | Time per Step (s) | |---------|--------|--------|--------|------------------|--------------------|-------------------------|------------|-------------------|----|-----------------------|-------------------| | VeOmni | `run_qwen_image_ocr_veomni.sh` | H100 | 64 | 64 | 64 | 0 (sync) | 32 | 16 | 3e-5 | 0.079 | 100 | | FSDP1 | `run_qwen_image_ocr.sh` | H100 | 64 | 64 | 64 | 0 (sync) | 32 | 16 | 3e-5 | 0.077 | 105 | > **Note**: VeOmni and FSDP1 run with `actor_rollout_ref.actor.veomni_config.param_offload=False`, `actor_rollout_ref.actor.veomni_config.optimizer_offload=True`, and `SP=1`. ### Full-Model Training - Zero Standard Deviation Ratio and Reward Curve

### Training - Clip Fraction

### Full-Model Validation Reward Curve