# Supported Models

Last updated: 06/26/2026.

VeRL-Omni supports RL post-training for generative models across image, video,
audio, and omni modalities. This page catalogues every model with a ready-to-run
example, its architecture and pipeline details, supported trainers, and hardware
requirements.

---

## Diffusion Image Models

### Qwen-Image

| Property | Detail |
|----------|--------|
| **Hugging Face ID** | `Qwen/Qwen-Image` |
| **Architecture** | MM-DiT (Multi-Modal Diffusion Transformer) with joint image-text attention |
| **Modality** | Text → Image |
| **Pipeline** | Flow-matching with True CFG and distilled guidance embedding |
| **Text encoder** | Qwen2-style tokenizer + T5-style encoder |
| **Resolution** | Variable (512×512, 1024×1024) |

**Supported trainers:**

| Trainer | Example script | GPU config |
|---------|---------------|------------|
| Flow-GRPO (LoRA) | `examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_lora.sh` | 4×GPU |
| Flow-GRPO (full) | `examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr.sh` | 4×H200 |
| Flow-GRPO (async) | `examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_lora_async_reward.sh` | 5×GPU |
| Flow-GRPO (multi-node) | `examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_lora_multi_node.sh` | 2×4 GPU |
| Flow-GRPO (SP=2) | `examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_lora_sp2.sh` | 4×GPU |
| Flow-GRPO (rollout-corr) | `examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_lora_rollout_corr.sh` | 4×GPU |
| Flow-GRPO (VeOmni) | `examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_veomni.sh` | 64×H100 |
| Flow-GRPO (NPU) | `examples/flowgrpo_trainer/qwen_image/run_qwen_image_ocr_lora_npu.sh` | 8×NPU |
| Flow-DPPO | `examples/flowdppo_trainer/qwen_image/run_qwen_image_ocr_lora.sh` | 4×GPU |
| GRPO-Guard | `examples/grpoguard_trainer/qwen_image/run_qwen_image_ocr_lora.sh` | 4×GPU |
| Mix-GRPO | `examples/mixgrpo_trainer/qwen_image/run_qwen_image_ocr_lora_mixgrpo.sh` | 4×GPU |
| Diffusion-DPO | `examples/dpo_trainer/qwen_image/run_qwen_image_online_dpo_lora.sh` | 4×GPU |
| DiffusionNFT | `examples/diffusionnft_trainer/qwen_image/run_qwen_image_ocr_lora.sh` | 4×GPU |

**Reward model:** `Qwen/Qwen3-VL-8B-Instruct` (OCR VLM judge, TP=4 colocated).

### Stable Diffusion 3.5 Medium

| Property | Detail |
|----------|--------|
| **Hugging Face ID** | `stabilityai/stable-diffusion-3.5-medium` |
| **Architecture** | MM-DiT with dual CLIP + T5 text encoders |
| **Modality** | Text → Image |
| **Pipeline** | Flow-matching (distilled guidance only, no True CFG) |
| **Text encoder** | CLIP-L, CLIP-G, T5-XXL |
| **Default resolution** | 384×384 |
| **Chat template** | Custom — extracts raw user content only (no system prompt) |

**Supported trainers:**

| Trainer | Example script | GPU config |
|---------|---------------|------------|
| Flow-GRPO (LoRA) | `examples/flowgrpo_trainer/sd35/run_sd35_medium_ocr_lora.sh` | 3×GPU (2 actor+rollout, 1 reward) |
| Diffusion-DPO (offline) | `examples/dpo_trainer/sd35/run_sd35_medium_offline_dpo_lora.sh` | 3×GPU |

**Reward model:** `Qwen/Qwen2.5-VL-3B-Instruct` (OCR VLM judge, TP=1, dedicated pool).

---

## Diffusion Video Models

### Wan2.2-TI2V-5B

| Property | Detail |
|----------|--------|
| **Hugging Face ID** | `Wan-AI/Wan2.2-TI2V-5B-Diffusers` |
| **Architecture** | Wan-style DiT with separate self-attention and cross-attention |
| **Modality** | Text → Video |
| **Pipeline** | Flow-matching with spatiotemporal latents |
| **Text encoder** | T5 |
| **Latent stream** | Spatiotemporal video latents |
| **Prompt stream** | Text-encoder tokens (cross-attention KV) |
| **SDE variants** | `dance_sde` (recommended, score-based), `sde` (FlowGRPO), `cps` (consistency-preserving) |

**Supported trainers:**

| Trainer | Example script | GPU config |
|---------|---------------|------------|
| DanceGRPO (HPSv3) | `examples/dancegrpo_trainer/wan22/run_wan22_5b_t2v_hpsv3_npu.sh` | 8×NPU (Ascend 800T A2) |

**Reward model:** HPSv3 (Human Preference Score v3) — local safetensors checkpoint
placed at `$WORKSPACE/CKPT/HPSv3/HPSv3.safetensors`.

The HPSv3 reward is the only validated configuration. Other reward functions
(e.g. OCR, aesthetic score) can be plugged in by changing
`reward.custom_reward_function`.

---

## Unified Multimodal Models

### BAGEL

| Property | Detail |
|----------|--------|
| **Architecture** | Unified multimodal understanding + generation |
| **Modality** | Text + Image (understand and generate) |
| **Deploy config** | `examples/flowgrpo_trainer/bagel/bagel_deploy_config.yaml` |
| **Rollout** | vLLM-Omni with per-stage YAML for engine memory/batching control |

**Supported trainers:**

| Trainer | Example script | GPU config |
|---------|---------------|------------|
| Flow-GRPO (LoRA, OCR) | `examples/flowgrpo_trainer/bagel/run_bagel_ocr_lora.sh` | 4×GPU |
| Flow-GRPO (LoRA, PickScore) | `examples/flowgrpo_trainer/bagel/run_bagel_pickscore_lora.sh` | 4×GPU |

BAGEL uses a per-stage deploy YAML that overrides top-level vLLM engine arguments
— tune `gpu_memory_utilization` and batch sizes directly in the stage config file.

---

## Omni-Modality Models

### Qwen3-Omni-30B-A3B Thinker

| Property | Detail |
|----------|--------|
| **Hugging Face ID** | `Qwen/Qwen3-Omni-30B-A3B-Instruct` |
| **Architecture** | Omni-modality Thinker with Mixture-of-Experts (30B total, 3B active) |
| **Modality** | Text + Image + Audio + Video (understand and generate) |
| **Trainer type** | GSPO — Group Sampling Policy Optimization (verl-native PPO-style) |
| **FSDP** | Full FSDP with LoRA (rank 64), param and optimizer CPU offload |
| **Rollout** | vLLM-Omni TP=4 colocated on the same GPUs as the FSDP actor |
| **Stage config** | `examples/gspo_trainer/qwen3_omni/qwen3_omni_thinker_only.yaml` (`gpu_memory_utilization=0.4`) |
| **External module** | `verl_omni.models.transformers.qwen3_omni_thinker` |

For version requirements and detailed setup instructions, see
[`examples/gspo_trainer/README.md`](../../examples/gspo_trainer/README.md).

**Supported trainers:**

| Trainer | Example script | GPU config |
|---------|---------------|------------|
| GSPO (math) | `examples/gspo_trainer/qwen3_omni/run_qwen3_omni_thinker_gspo_lora.sh` | 4×H100/H200 80GB |

The actor (FSDP, 30B + LoRA r=64 with offloading) and vLLM-Omni rollout (TP=4)
colocate on the same 4 GPUs. `gpu_memory_utilization` is kept at `0.4` in the
stage config to leave headroom for the FSDP actor.

---

## Model Architecture Summary

| Model | Architecture | Text encoder |
|-------|-------------|-------------|
| Qwen-Image | MM-DiT | Qwen2 + T5 |
| SD3.5 Medium | MM-DiT | CLIP-L + CLIP-G + T5 |
| Wan2.2-TI2V-5B | Wan DiT | T5 |
| BAGEL | Unified MM | — |
| Qwen3-Omni-30B | Omni MoE | Qwen3 |

---

## Reward Models

| Reward model | HF ID / Source | Modality | Used by | Deployment |
|-------------|---------------|----------|---------|------------|
| Qwen3-VL-8B-Instruct | `Qwen/Qwen3-VL-8B-Instruct` | Vision-Language | Qwen-Image (all trainers) | vLLM, TP=4, colocated |
| Qwen2.5-VL-3B-Instruct | `Qwen/Qwen2.5-VL-3B-Instruct` | Vision-Language | SD3.5 (Flow-GRPO) | vLLM, TP=1, dedicated pool |
| HPSv3 | Local `.safetensors` | Vision (aesthetic) | Wan2.2 (DanceGRPO) | Local safetensors load |
| HTTP scorer | External HTTP service | Any | Any model | Gunicorn/Flask, pickle protocol |
| JPEG incompressibility | Rule-based | Image stats | Any diffusion model | No model process needed |

For end-to-end instructions on setting up each reward, see the respective
trainer's README in `examples/`.

---

## Which Trainer for Which Model?

| Algorithm | Qwen-Image | SD3.5 | Wan2.2 | BAGEL | Qwen3-Omni |
|-----------|:---:|:---:|:---:|:---:|:---:|
| Flow-GRPO | ✅ | ✅ | — | ✅ | — |
| Flow-DPPO | ✅ | — | — | — | — |
| GRPO-Guard | ✅ | — | — | — | — |
| Mix-GRPO | ✅ | — | — | — | — |
| DanceGRPO | — | — | ✅ | — | — |
| Diffusion-DPO | ✅ | ✅ | — | — | — |
| DiffusionNFT | ✅ | — | — | — | — |
| GSPO | — | — | — | — | ✅ |