# Supported Models Last updated: 06/26/2026. VeRL-Omni supports RL post-training for generative models across image, video, audio, and omni modalities. This page catalogues every model with a ready-to-run example, its architecture and pipeline details, supported trainers, and hardware requirements. --- ## Diffusion Image Models ### Qwen-Image | Property | Detail | |----------|--------| | **Hugging Face ID** | `Qwen/Qwen-Image` | | **Architecture** | MM-DiT (Multi-Modal Diffusion Transformer) with joint image-text attention | | **Modality** | Text → Image | | **Pipeline** | Flow-matching with True CFG and distilled guidance embedding | | **Text encoder** | Qwen2-style tokenizer + T5-style encoder | | **Resolution** | Variable (512×512, 1024×1024) | **Supported trainers:** | Trainer | Example script | GPU config | |---------|---------------|------------| | Flow-GRPO (LoRA) | `examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_lora.sh` | 4×GPU | | Flow-GRPO (full) | `examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr.sh` | 4×H200 | | Flow-GRPO (async) | `examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_lora_async_reward.sh` | 5×GPU | | Flow-GRPO (multi-node) | `examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_lora_multi_node.sh` | 2×4 GPU | | Flow-GRPO (SP=2) | `examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_lora_sp2.sh` | 4×GPU | | Flow-GRPO (rollout-corr) | `examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_lora_rollout_corr.sh` | 4×GPU | | Flow-GRPO (VeOmni) | `examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_veomni.sh` | 64×H100 | | Flow-GRPO (NPU) | `examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_lora_npu.sh` | 8×NPU | | Flow-DPPO | `examples/flowdppo_trainer/qwen_image/run_qwen_image_ocr_lora.sh` | 4×GPU | | GRPO-Guard | `examples/grpoguard_trainer/qwen_image/run_qwen_image_ocr_lora.sh` | 4×GPU | | Mix-GRPO | `examples/mixgrpo_trainer/qwen_image/run_qwen_image_ocr_lora_mixgrpo.sh` | 4×GPU | | Diffusion-DPO | `examples/dpo_trainer/qwen_image/run_qwen_image_online_dpo_lora.sh` | 4×GPU | | DiffusionNFT | `examples/diffusionnft_trainer/qwen_image/run_qwen_image_ocr_lora.sh` | 4×GPU | **Reward model:** `Qwen/Qwen3-VL-8B-Instruct` (OCR VLM judge, TP=4 colocated). ### Stable Diffusion 3.5 Medium | Property | Detail | |----------|--------| | **Hugging Face ID** | `stabilityai/stable-diffusion-3.5-medium` | | **Architecture** | MM-DiT with dual CLIP + T5 text encoders | | **Modality** | Text → Image | | **Pipeline** | Flow-matching (distilled guidance only, no True CFG) | | **Text encoder** | CLIP-L, CLIP-G, T5-XXL | | **Default resolution** | 384×384 | | **Chat template** | Custom — extracts raw user content only (no system prompt) | **Supported trainers:** | Trainer | Example script | GPU config | |---------|---------------|------------| | Flow-GRPO (LoRA) | `examples/flowgrpo_trainer/sd35/run_sd35_medium_ocr_lora.sh` | 3×GPU (2 actor+rollout, 1 reward) | | Diffusion-DPO (offline) | `examples/dpo_trainer/sd35/run_sd35_medium_offline_dpo_lora.sh` | 3×GPU | **Reward model:** `Qwen/Qwen2.5-VL-3B-Instruct` (OCR VLM judge, TP=1, dedicated pool). --- ## Diffusion Video Models ### Wan2.2-TI2V-5B | Property | Detail | |----------|--------| | **Hugging Face ID** | `Wan-AI/Wan2.2-TI2V-5B-Diffusers` | | **Architecture** | Wan-style DiT with separate self-attention and cross-attention | | **Modality** | Text → Video | | **Pipeline** | Flow-matching with spatiotemporal latents | | **Text encoder** | T5 | | **Latent stream** | Spatiotemporal video latents | | **Prompt stream** | Text-encoder tokens (cross-attention KV) | | **SDE variants** | `dance_sde` (recommended, score-based), `sde` (FlowGRPO), `cps` (consistency-preserving) | **Supported trainers:** | Trainer | Example script | GPU config | |---------|---------------|------------| | DanceGRPO (HPSv3) | `examples/dancegrpo_trainer/wan22/run_wan22_5b_t2v_hpsv3_npu.sh` | 8×NPU (Ascend 800T A2) | **Reward model:** HPSv3 (Human Preference Score v3) — local safetensors checkpoint placed at `$WORKSPACE/CKPT/HPSv3/HPSv3.safetensors`. The HPSv3 reward is the only validated configuration. Other reward functions (e.g. OCR, aesthetic score) can be plugged in by changing `reward.custom_reward_function`. --- ## Unified Multimodal Models ### BAGEL | Property | Detail | |----------|--------| | **Architecture** | Unified multimodal understanding + generation | | **Modality** | Text + Image (understand and generate) | | **Deploy config** | `examples/flowgrpo_trainer/bagel/bagel_deploy_config.yaml` | | **Rollout** | vLLM-Omni with per-stage YAML for engine memory/batching control | **Supported trainers:** | Trainer | Example script | GPU config | |---------|---------------|------------| | Flow-GRPO (LoRA, OCR) | `examples/flowgrpo_trainer/bagel/run_bagel_ocr_lora.sh` | 4×GPU | | Flow-GRPO (LoRA, PickScore) | `examples/flowgrpo_trainer/bagel/run_bagel_pickscore_lora.sh` | 4×GPU | BAGEL uses a per-stage deploy YAML that overrides top-level vLLM engine arguments — tune `gpu_memory_utilization` and batch sizes directly in the stage config file. --- ## Omni-Modality Models ### Qwen3-Omni-30B-A3B Thinker | Property | Detail | |----------|--------| | **Hugging Face ID** | `Qwen/Qwen3-Omni-30B-A3B-Instruct` | | **Architecture** | Omni-modality Thinker with Mixture-of-Experts (30B total, 3B active) | | **Modality** | Text + Image + Audio + Video (understand and generate) | | **Trainer type** | GSPO — Group Sampling Policy Optimization (verl-native PPO-style) | | **FSDP** | Full FSDP with LoRA (rank 64), param and optimizer CPU offload | | **Rollout** | vLLM-Omni TP=4 colocated on the same GPUs as the FSDP actor | | **Stage config** | `examples/gspo_trainer/qwen3_omni/qwen3_omni_thinker_only.yaml` (`gpu_memory_utilization=0.4`) | | **External module** | `verl_omni.models.transformers.qwen3_omni_thinker` | For version requirements and detailed setup instructions, see [`examples/gspo_trainer/README.md`](../../examples/gspo_trainer/README.md). **Supported trainers:** | Trainer | Example script | GPU config | |---------|---------------|------------| | GSPO (math) | `examples/gspo_trainer/qwen3_omni/run_qwen3_omni_thinker_gspo_lora.sh` | 4×H100/H200 80GB | The actor (FSDP, 30B + LoRA r=64 with offloading) and vLLM-Omni rollout (TP=4) colocate on the same 4 GPUs. `gpu_memory_utilization` is kept at `0.4` in the stage config to leave headroom for the FSDP actor. --- ## Model Architecture Summary | Model | Architecture | Text encoder | |-------|-------------|-------------| | Qwen-Image | MM-DiT | Qwen2 + T5 | | SD3.5 Medium | MM-DiT | CLIP-L + CLIP-G + T5 | | Wan2.2-TI2V-5B | Wan DiT | T5 | | BAGEL | Unified MM | — | | Qwen3-Omni-30B | Omni MoE | Qwen3 | --- ## Reward Models | Reward model | HF ID / Source | Modality | Used by | Deployment | |-------------|---------------|----------|---------|------------| | Qwen3-VL-8B-Instruct | `Qwen/Qwen3-VL-8B-Instruct` | Vision-Language | Qwen-Image (all trainers) | vLLM, TP=4, colocated | | Qwen2.5-VL-3B-Instruct | `Qwen/Qwen2.5-VL-3B-Instruct` | Vision-Language | SD3.5 (Flow-GRPO) | vLLM, TP=1, dedicated pool | | HPSv3 | Local `.safetensors` | Vision (aesthetic) | Wan2.2 (DanceGRPO) | Local safetensors load | | HTTP scorer | External HTTP service | Any | Any model | Gunicorn/Flask, pickle protocol | | JPEG incompressibility | Rule-based | Image stats | Any diffusion model | No model process needed | For end-to-end instructions on setting up each reward, see the respective trainer's README in `examples/`. --- ## Which Trainer for Which Model? | Algorithm | Qwen-Image | SD3.5 | Wan2.2 | BAGEL | Qwen3-Omni | |-----------|:---:|:---:|:---:|:---:|:---:| | Flow-GRPO | ✅ | ✅ | — | ✅ | — | | Flow-DPPO | ✅ | — | — | — | — | | GRPO-Guard | ✅ | — | — | — | — | | Mix-GRPO | ✅ | — | — | — | — | | DanceGRPO | — | — | ✅ | — | — | | Diffusion-DPO | ✅ | ✅ | — | — | — | | DiffusionNFT | ✅ | — | — | — | — | | GSPO | — | — | — | — | ✅ |