Supported Models

Last updated: 06/26/2026.

VeRL-Omni supports RL post-training for generative models across image, video, audio, and omni modalities. This page catalogues every model with a ready-to-run example, its architecture and pipeline details, supported trainers, and hardware requirements.

Diffusion Image Models

Qwen-Image

Property	Detail
Hugging Face ID	`Qwen/Qwen-Image`
Architecture	MM-DiT (Multi-Modal Diffusion Transformer) with joint image-text attention
Modality	Text → Image
Pipeline	Flow-matching with True CFG and distilled guidance embedding
Text encoder	Qwen2-style tokenizer + T5-style encoder
Resolution	Variable (512×512, 1024×1024)

Supported trainers:

Trainer	Example script	GPU config
Flow-GRPO (LoRA)	`examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_lora.sh`	4×GPU
Flow-GRPO (full)	`examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr.sh`	4×H200
Flow-GRPO (async)	`examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_lora_async_reward.sh`	5×GPU
Flow-GRPO (multi-node)	`examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_lora_multi_node.sh`	2×4 GPU
Flow-GRPO (SP=2)	`examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_lora_sp2.sh`	4×GPU
Flow-GRPO (rollout-corr)	`examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_lora_rollout_corr.sh`	4×GPU
Flow-GRPO (VeOmni)	`examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_veomni.sh`	64×H100
Flow-GRPO (NPU)	`examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_lora_npu.sh`	8×NPU
Flow-DPPO	`examples/flowdppo_trainer/qwen_image/run_qwen_image_ocr_lora.sh`	4×GPU
GRPO-Guard	`examples/grpoguard_trainer/qwen_image/run_qwen_image_ocr_lora.sh`	4×GPU
Mix-GRPO	`examples/mixgrpo_trainer/qwen_image/run_qwen_image_ocr_lora_mixgrpo.sh`	4×GPU
Diffusion-DPO	`examples/dpo_trainer/qwen_image/run_qwen_image_online_dpo_lora.sh`	4×GPU
DiffusionNFT	`examples/diffusionnft_trainer/qwen_image/run_qwen_image_ocr_lora.sh`	4×GPU

Reward model: Qwen/Qwen3-VL-8B-Instruct (OCR VLM judge, TP=4 colocated).

Stable Diffusion 3.5 Medium

Property	Detail
Hugging Face ID	`stabilityai/stable-diffusion-3.5-medium`
Architecture	MM-DiT with dual CLIP + T5 text encoders
Modality	Text → Image
Pipeline	Flow-matching (distilled guidance only, no True CFG)
Text encoder	CLIP-L, CLIP-G, T5-XXL
Default resolution	384×384
Chat template	Custom — extracts raw user content only (no system prompt)

Supported trainers:

Trainer	Example script	GPU config
Flow-GRPO (LoRA)	`examples/flowgrpo_trainer/sd35/run_sd35_medium_ocr_lora.sh`	3×GPU (2 actor+rollout, 1 reward)
Diffusion-DPO (offline)	`examples/dpo_trainer/sd35/run_sd35_medium_offline_dpo_lora.sh`	3×GPU

Reward model: Qwen/Qwen2.5-VL-3B-Instruct (OCR VLM judge, TP=1, dedicated pool).

Diffusion Video Models

Wan2.2-TI2V-5B

Property	Detail
Hugging Face ID	`Wan-AI/Wan2.2-TI2V-5B-Diffusers`
Architecture	Wan-style DiT with separate self-attention and cross-attention
Modality	Text → Video
Pipeline	Flow-matching with spatiotemporal latents
Text encoder	T5
Latent stream	Spatiotemporal video latents
Prompt stream	Text-encoder tokens (cross-attention KV)
SDE variants	`dance_sde` (recommended, score-based), `sde` (FlowGRPO), `cps` (consistency-preserving)

Supported trainers:

Trainer	Example script	GPU config
DanceGRPO (HPSv3)	`examples/dancegrpo_trainer/wan22/run_wan22_5b_t2v_hpsv3_npu.sh`	8×NPU (Ascend 800T A2)

Reward model: HPSv3 (Human Preference Score v3) — local safetensors checkpoint placed at $WORKSPACE/CKPT/HPSv3/HPSv3.safetensors.

The HPSv3 reward is the only validated configuration. Other reward functions (e.g. OCR, aesthetic score) can be plugged in by changing reward.custom_reward_function.

Unified Multimodal Models

BAGEL

Property	Detail
Architecture	Unified multimodal understanding + generation
Modality	Text + Image (understand and generate)
Deploy config	`examples/flowgrpo_trainer/bagel/bagel_deploy_config.yaml`
Rollout	vLLM-Omni with per-stage YAML for engine memory/batching control

Supported trainers:

Trainer	Example script	GPU config
Flow-GRPO (LoRA, OCR)	`examples/flowgrpo_trainer/bagel/run_bagel_ocr_lora.sh`	4×GPU
Flow-GRPO (LoRA, PickScore)	`examples/flowgrpo_trainer/bagel/run_bagel_pickscore_lora.sh`	4×GPU

BAGEL uses a per-stage deploy YAML that overrides top-level vLLM engine arguments — tune gpu_memory_utilization and batch sizes directly in the stage config file.

Omni-Modality Models

Qwen3-Omni-30B-A3B Thinker

Property	Detail
Hugging Face ID	`Qwen/Qwen3-Omni-30B-A3B-Instruct`
Architecture	Omni-modality Thinker with Mixture-of-Experts (30B total, 3B active)
Modality	Text + Image + Audio + Video (understand and generate)
Trainer type	GSPO — Group Sampling Policy Optimization (verl-native PPO-style)
FSDP	Full FSDP with LoRA (rank 64), param and optimizer CPU offload
Rollout	vLLM-Omni TP=4 colocated on the same GPUs as the FSDP actor
Stage config	`examples/gspo_trainer/qwen3_omni/qwen3_omni_thinker_only.yaml` (`gpu_memory_utilization=0.4`)
External module	`verl_omni.models.transformers.qwen3_omni_thinker`

For version requirements and detailed setup instructions, see examples/gspo_trainer/README.md.

Supported trainers:

Trainer	Example script	GPU config
GSPO (math)	`examples/gspo_trainer/qwen3_omni/run_qwen3_omni_thinker_gspo_lora.sh`	4×H100/H200 80GB

The actor (FSDP, 30B + LoRA r=64 with offloading) and vLLM-Omni rollout (TP=4) colocate on the same 4 GPUs. gpu_memory_utilization is kept at 0.4 in the stage config to leave headroom for the FSDP actor.

Model Architecture Summary

Model	Architecture	Text encoder
Qwen-Image	MM-DiT	Qwen2 + T5
SD3.5 Medium	MM-DiT	CLIP-L + CLIP-G + T5
Wan2.2-TI2V-5B	Wan DiT	T5
BAGEL	Unified MM	—
Qwen3-Omni-30B	Omni MoE	Qwen3

Reward Models

Reward model	HF ID / Source	Modality	Used by	Deployment
Qwen3-VL-8B-Instruct	`Qwen/Qwen3-VL-8B-Instruct`	Vision-Language	Qwen-Image (all trainers)	vLLM, TP=4, colocated
Qwen2.5-VL-3B-Instruct	`Qwen/Qwen2.5-VL-3B-Instruct`	Vision-Language	SD3.5 (Flow-GRPO)	vLLM, TP=1, dedicated pool
HPSv3	Local `.safetensors`	Vision (aesthetic)	Wan2.2 (DanceGRPO)	Local safetensors load
HTTP scorer	External HTTP service	Any	Any model	Gunicorn/Flask, pickle protocol
JPEG incompressibility	Rule-based	Image stats	Any diffusion model	No model process needed

For end-to-end instructions on setting up each reward, see the respective trainer’s README in examples/.

Which Trainer for Which Model?

Algorithm	Qwen-Image	SD3.5	Wan2.2	BAGEL	Qwen3-Omni
Flow-GRPO	✅	✅	—	✅	—
Flow-DPPO	✅	—	—	—	—
GRPO-Guard	✅	—	—	—	—
Mix-GRPO	✅	—	—	—	—
DanceGRPO	—	—	✅	—	—
Diffusion-DPO	✅	✅	—	—	—
DiffusionNFT	✅	—	—	—	—
GSPO	—	—	—	—	✅