Supported Models
Last updated: 06/26/2026.
VeRL-Omni supports RL post-training for generative models across image, video, audio, and omni modalities. This page catalogues every model with a ready-to-run example, its architecture and pipeline details, supported trainers, and hardware requirements.
Diffusion Image Models
Qwen-Image
Property |
Detail |
|---|---|
Hugging Face ID |
|
Architecture |
MM-DiT (Multi-Modal Diffusion Transformer) with joint image-text attention |
Modality |
Text → Image |
Pipeline |
Flow-matching with True CFG and distilled guidance embedding |
Text encoder |
Qwen2-style tokenizer + T5-style encoder |
Resolution |
Variable (512×512, 1024×1024) |
Supported trainers:
Trainer |
Example script |
GPU config |
|---|---|---|
Flow-GRPO (LoRA) |
|
4×GPU |
Flow-GRPO (full) |
|
4×H200 |
Flow-GRPO (async) |
|
5×GPU |
Flow-GRPO (multi-node) |
|
2×4 GPU |
Flow-GRPO (SP=2) |
|
4×GPU |
Flow-GRPO (rollout-corr) |
|
4×GPU |
Flow-GRPO (VeOmni) |
|
64×H100 |
Flow-GRPO (NPU) |
|
8×NPU |
Flow-DPPO |
|
4×GPU |
GRPO-Guard |
|
4×GPU |
Mix-GRPO |
|
4×GPU |
Diffusion-DPO |
|
4×GPU |
DiffusionNFT |
|
4×GPU |
Reward model: Qwen/Qwen3-VL-8B-Instruct (OCR VLM judge, TP=4 colocated).
Stable Diffusion 3.5 Medium
Property |
Detail |
|---|---|
Hugging Face ID |
|
Architecture |
MM-DiT with dual CLIP + T5 text encoders |
Modality |
Text → Image |
Pipeline |
Flow-matching (distilled guidance only, no True CFG) |
Text encoder |
CLIP-L, CLIP-G, T5-XXL |
Default resolution |
384×384 |
Chat template |
Custom — extracts raw user content only (no system prompt) |
Supported trainers:
Trainer |
Example script |
GPU config |
|---|---|---|
Flow-GRPO (LoRA) |
|
3×GPU (2 actor+rollout, 1 reward) |
Diffusion-DPO (offline) |
|
3×GPU |
Reward model: Qwen/Qwen2.5-VL-3B-Instruct (OCR VLM judge, TP=1, dedicated pool).
Diffusion Video Models
Wan2.2-TI2V-5B
Property |
Detail |
|---|---|
Hugging Face ID |
|
Architecture |
Wan-style DiT with separate self-attention and cross-attention |
Modality |
Text → Video |
Pipeline |
Flow-matching with spatiotemporal latents |
Text encoder |
T5 |
Latent stream |
Spatiotemporal video latents |
Prompt stream |
Text-encoder tokens (cross-attention KV) |
SDE variants |
|
Supported trainers:
Trainer |
Example script |
GPU config |
|---|---|---|
DanceGRPO (HPSv3) |
|
8×NPU (Ascend 800T A2) |
Reward model: HPSv3 (Human Preference Score v3) — local safetensors checkpoint
placed at $WORKSPACE/CKPT/HPSv3/HPSv3.safetensors.
The HPSv3 reward is the only validated configuration. Other reward functions
(e.g. OCR, aesthetic score) can be plugged in by changing
reward.custom_reward_function.
Unified Multimodal Models
BAGEL
Property |
Detail |
|---|---|
Architecture |
Unified multimodal understanding + generation |
Modality |
Text + Image (understand and generate) |
Deploy config |
|
Rollout |
vLLM-Omni with per-stage YAML for engine memory/batching control |
Supported trainers:
Trainer |
Example script |
GPU config |
|---|---|---|
Flow-GRPO (LoRA, OCR) |
|
4×GPU |
Flow-GRPO (LoRA, PickScore) |
|
4×GPU |
BAGEL uses a per-stage deploy YAML that overrides top-level vLLM engine arguments
— tune gpu_memory_utilization and batch sizes directly in the stage config file.
Omni-Modality Models
Qwen3-Omni-30B-A3B Thinker
Property |
Detail |
|---|---|
Hugging Face ID |
|
Architecture |
Omni-modality Thinker with Mixture-of-Experts (30B total, 3B active) |
Modality |
Text + Image + Audio + Video (understand and generate) |
Trainer type |
GSPO — Group Sampling Policy Optimization (verl-native PPO-style) |
FSDP |
Full FSDP with LoRA (rank 64), param and optimizer CPU offload |
Rollout |
vLLM-Omni TP=4 colocated on the same GPUs as the FSDP actor |
Stage config |
|
External module |
|
For version requirements and detailed setup instructions, see
examples/gspo_trainer/README.md.
Supported trainers:
Trainer |
Example script |
GPU config |
|---|---|---|
GSPO (math) |
|
4×H100/H200 80GB |
The actor (FSDP, 30B + LoRA r=64 with offloading) and vLLM-Omni rollout (TP=4)
colocate on the same 4 GPUs. gpu_memory_utilization is kept at 0.4 in the
stage config to leave headroom for the FSDP actor.
Model Architecture Summary
Model |
Architecture |
Text encoder |
|---|---|---|
Qwen-Image |
MM-DiT |
Qwen2 + T5 |
SD3.5 Medium |
MM-DiT |
CLIP-L + CLIP-G + T5 |
Wan2.2-TI2V-5B |
Wan DiT |
T5 |
BAGEL |
Unified MM |
— |
Qwen3-Omni-30B |
Omni MoE |
Qwen3 |
Reward Models
Reward model |
HF ID / Source |
Modality |
Used by |
Deployment |
|---|---|---|---|---|
Qwen3-VL-8B-Instruct |
|
Vision-Language |
Qwen-Image (all trainers) |
vLLM, TP=4, colocated |
Qwen2.5-VL-3B-Instruct |
|
Vision-Language |
SD3.5 (Flow-GRPO) |
vLLM, TP=1, dedicated pool |
HPSv3 |
Local |
Vision (aesthetic) |
Wan2.2 (DanceGRPO) |
Local safetensors load |
HTTP scorer |
External HTTP service |
Any |
Any model |
Gunicorn/Flask, pickle protocol |
JPEG incompressibility |
Rule-based |
Image stats |
Any diffusion model |
No model process needed |
For end-to-end instructions on setting up each reward, see the respective
trainer’s README in examples/.
Which Trainer for Which Model?
Algorithm |
Qwen-Image |
SD3.5 |
Wan2.2 |
BAGEL |
Qwen3-Omni |
|---|---|---|---|---|---|
Flow-GRPO |
✅ |
✅ |
— |
✅ |
— |
Flow-DPPO |
✅ |
— |
— |
— |
— |
GRPO-Guard |
✅ |
— |
— |
— |
— |
Mix-GRPO |
✅ |
— |
— |
— |
— |
DanceGRPO |
— |
— |
✅ |
— |
— |
Diffusion-DPO |
✅ |
✅ |
— |
— |
— |
DiffusionNFT |
✅ |
— |
— |
— |
— |
GSPO |
— |
— |
— |
— |
✅ |