How to Integrate a New Diffusion Model for FlowGRPO Training

Last updated: 06/02/2026.

This guide walks you through everything required to integrate a new diffusion model into VeRL-Omni so it can be trained end-to-end with the FlowGRPO algorithm. The contracts described below (registry hooks, adapter classmethods, scheduler choice, custom-output field names) are specific to the FlowGRPO trainer; other RL algorithms may impose different requirements. Use integrating_a_new_policy_gradient_algorithm_for_diffusion_model.md for PPO-like policy-gradient algorithms, and integrating_a_new_direct_preference_algorithm_for_diffusion_model.md for direct-preference algorithms.

We use the Qwen-Image integration (verl_omni/pipelines/qwen_image_flow_grpo/) as the worked example throughout. Read the source alongside this guide — the code is the canonical reference.


TL;DR

A new model needs three files in one new package plus two registry hooks. The same adapters work with both the default diffusers + FSDP2 backend and the optional VeOmni backend — backend selection is purely a configuration concern.

verl_omni/pipelines/<model>_flow_grpo/
├── __init__.py                       # re-exports both adapters
├── diffusers_training_adapter.py     # subclass of DiffusionModelBase
└── vllm_omni_rollout_adapter.py      # subclass of VllmOmniPipelineBase

Both adapters are picked up by string-based registries that dispatch on the pair (model_index.json::_class_name, algorithm). By default the algorithm is read from actor_rollout_ref.model.algorithm, which is the source-of-truth in the current trainer wiring. Register the package by importing it from verl_omni/pipelines/__init__.py, add an example launch script, and add a smoke test.


Mental Model

There are two execution contexts you must serve, and they share the same algorithm (FlowGRPO) but use different runtimes:

Context

Runtime

What you implement

Rollout (sampling trajectories)

vllm-omni

VllmOmniPipelineBase subclass — runs the SDE loop and returns latents, log-probs, and prompt embeddings.

Training (per-step forward + loss)

FSDP + diffusers

DiffusionModelBase subclass — re-runs one denoising step per micro-batch slot to compute fresh log-probs for the policy gradient.

The trainer runtime can also be VeOmni’s FSDP2-based DiT trainer; see § 5.3. The training-adapter contract (prepare_model_inputs / forward_and_sample_previous_step) is identical on both backends.

  ┌─────────────────────────┐                ┌──────────────────────────┐
  │ Rollout worker          │   trajectory   │ Trainer worker           │
  │ (vllm-omni)             │ ─────────────▶ │ (FSDP + diffusers)       │
  │                         │  latents,      │                          │
  │ VllmOmniPipelineBase    │  log_probs,    │ DiffusionModelBase       │
  │  └─ diffuse() + SDE     │  prompt embeds │  └─ prepare_model_inputs │
  │                         │                │  └─ forward_and_sample…  │
  └─────────────────────────┘                └──────────────────────────┘

The two adapters must agree on:

  • Architecture string (the first @register(...) argument). It must match model_index.json::_class_name exactly. For Qwen-Image this is "QwenImagePipeline".

  • Algorithm string (the algorithm= keyword on @register(...)). For this guide the value is always "flow_grpo". When integrating a different RL algorithm use the appropriate algorithm name and the matching algorithm-family guide.

  • Prompt-encoding format of the embeddings shipped through the agent loop. The rollout always returns padded (B, L, D) + (B, L) mask; the training adapter is free to convert to whatever the transformer needs.

  • Scheduler choice so log-probs computed on each side are comparable.


Prerequisites

Before you start, the new model must already be supported upstream by:

  • diffusers — provides the transformer (<Name>Transformer2DModel), scheduler config, and a reference inference pipeline.

  • vllm-omni — provides the rollout-side <Name>Pipeline. Your rollout adapter inherits from this class.

If either is missing, upstream the model first. Nothing below will work without them.


Step 1 — Read the Upstream Pipelines and Note the Differences

Open the upstream diffusers pipeline (__call__) and the vllm-omni rollout pipeline (forward). Answer these questions before writing any code — the answers determine every helper you need:

  1. Latent shape. Packed sequence (B, seq, 4·C) (Qwen-Image) or 4-D (B, C, H, W)?

  2. Text encoder output. Fixed (B, L, D) plus a mask, or a list of variable-length per-sample tensors?

  3. Transformer signature. What kwargs does it accept? Any extras (img_shapes, txt_seq_lens, guidance, …)?

  4. Timestep convention. t/1000? (1000 - t)/1000? Something else?

  5. Output sign. Is the predicted velocity / noise negated before being passed to the scheduler?

  6. CFG flavour. “True CFG” with renormalisation? Standard CFG with optional norm clipping? At what threshold is CFG active?

  7. VAE post-processing. latents / scaling_factor + shift_factor, latents / std + mean, or other?

  8. Prompt template. Does the upstream _encode_prompt prepend a hard-coded system prompt? Whatever it does, your data preprocessor must match exactly so training-time and inference-time tokenisation agree.

Anything model-specific belongs inside the model’s own package; anything reusable belongs in pipelines/utils.py or pipelines/model_base.py.


Step 2 — Scaffold the Package

Create the new package and start by copying verl_omni/pipelines/qwen_image_flow_grpo/ as a template:

verl_omni/pipelines/<model>_flow_grpo/
├── __init__.py
├── diffusers_training_adapter.py
└── vllm_omni_rollout_adapter.py

The __init__.py re-exports both adapters so the @register(...) decorators run on import — follow the existing Qwen-Image pattern:

from .diffusers_training_adapter import MyModel
from .vllm_omni_rollout_adapter import MyModelPipelineWithLogProb

__all__ = ["MyModel", "MyModelPipelineWithLogProb"]

Finally, register the package by adding a star-import to verl_omni/pipelines/__init__.py so both registries learn about your model when verl_omni.pipelines is imported:

from .qwen_image_flow_grpo import *  # noqa: F401, F403
from .my_model_flow_grpo import *    # noqa: F401, F403

__all__ = list(qwen_image_flow_grpo.__all__)
__all__ += my_model_flow_grpo.__all__

Note. vllm_omni is a hard dependency of verl-omni, so the rollout adapter import does not need to be guarded.


Step 3 — Write diffusers_training_adapter.py

Subclass DiffusionModelBase, decorate it with the architecture string, and implement the four classmethods:

@DiffusionModelBase.register("MyModelPipeline", algorithm="flow_grpo")
class MyModel(DiffusionModelBase):
    @classmethod
    def build_scheduler(cls, model_config): ...

    @classmethod
    def set_timesteps(cls, scheduler, model_config, device): ...

    @classmethod
    def prepare_model_inputs(cls, module, model_config, latents, timesteps,
                             prompt_embeds, prompt_embeds_mask,
                             negative_prompt_embeds, negative_prompt_embeds_mask,
                             micro_batch, step): ...

    @classmethod
    def forward_and_sample_previous_step(cls, module, scheduler, model_config,
                                         model_inputs, negative_model_inputs,
                                         scheduler_inputs, step): ...

3.1 build_scheduler and set_timesteps

Reuse FlowMatchSDEDiscreteScheduler unless you have a strong reason not to — FlowGRPO only requires a flow-matching scheduler that exposes sample_previous_step(...).

Compute image_seq_len and mu exactly as the upstream diffusers pipeline does. If they drift, the training-time noise schedule will not match deployment.

3.2 prepare_model_inputs

This method receives the full batched tensors for the entire denoising trajectory (latents of shape (B, T, ...), timesteps of shape (B, T)) together with the step index. Your implementation is responsible for slicing to the current step, e.g. latents[:, step] and timesteps[:, step], before building model inputs. The typical steps are:

  1. Slice latents[:, step] and timesteps[:, step] for the current denoising step.

  2. Apply per-model timestep rescaling.

  3. Convert padded prompt embeddings + mask to whatever format your transformer expects.

  4. Build the positive input dict and, if CFG is enabled, the negative input dict (same latent + timestep, negative text features).

The dict keys must match the kwargs of the diffusers transformer class verbatim — the FSDP engine calls module(**model_inputs).

3.3 forward_and_sample_previous_step

Call the transformer once for the positive prompt; if CFG is active, call it again for the negative prompt and combine them. Always finish with scheduler.sample_previous_step(...) and return the triple (log_prob, prev_sample_mean, std_dev_t) — that is what PPODiffusersFSDPEngine.prepare_model_outputs consumes.

Tip. If your transformer returns a list (one element per sample), wrap the call in a small helper that re-stacks to (B, C, H, W) so the rest of the pipeline keeps a single tensor convention.


Step 4 — Write vllm_omni_rollout_adapter.py

Subclass the upstream <Name>Pipeline from vllm_omni.diffusion.models and decorate with the same architecture/algorithm pair:

@VllmOmniPipelineBase.register("MyModelPipeline", algorithm="flow_grpo")
class MyModelPipelineWithLogProb(MyModelPipeline):
    ...

Your subclass must do four things:

  1. Replace the upstream scheduler (typically Euler-based) with FlowMatchSDEDiscreteScheduler.

  2. Override encode_prompt to accept pre-tokenised prompt_ids and the tokenizer attention mask (the agent loop ships these — never raw strings). Always return a padded (B, L, D) tensor and a (B, L) mask so the agent loop can ferry them as plain tensors.

  3. Implement diffuse(...) — the SDE loop that optionally applies CFG and collects all_latents, all_log_probs, and all_timesteps.

  4. Override forward(req, ...) so that:

    • Sampling parameters come from req.sampling_params (use extra_args for SDE-specific knobs).

    • prompt_embeds, prompt_embeds_mask, negative_prompt_embeds, and negative_prompt_embeds_mask are placed in the returned DiffusionOutput.custom_output. The diffusion agent loop (diffusion_agent_loop.py) reads these field names verbatim — do not rename them.


Step 5 — Configure the Pipeline

No code changes are required in the trainer launcher itself. At runtime:

  • DiffusionModelConfig.architecture is auto-detected from model_index.json.

  • DiffusionModelConfig.algorithm is set by actor_rollout_ref.model.algorithm (default flow_grpo in diffusion_model.yaml). algorithm.adv_estimator is templated to read from this same value.

  • DiffusionModelBase.get_class(model_config) resolves to the training adapter registered under (architecture, algorithm).

  • VllmOmniPipelineBase.get_class(architecture, algorithm) resolves to the rollout adapter and is consumed by the vllm-omni rollout worker.

5.1 Pipeline Config Knobs

Pipeline sampling parameters live under actor_rollout_ref.rollout.pipeline.* (mapped to DiffusionPipelineConfig) and are mirrored in actor_rollout_ref.model.pipeline.*.

Always copy the defaults from the upstream HuggingFace model card so RL exploration starts from a known-good operating point.

Knob

Notes

pipeline.height, pipeline.width

Must be a multiple of vae_scale_factor * 2.

pipeline.num_inference_steps

Steps used during training rollout. Default 10 — do not override unless you know why.

actor_rollout_ref.rollout.val_kwargs.pipeline.num_inference_steps

Full-quality steps for validation only (e.g. 50).

pipeline.true_cfg_scale

For Qwen-Image-style true CFG (e.g. 4.0). Default 1.0 (disabled).

pipeline.guidance_scale

For pipelines whose upstream uses guidance_scale. Default null defers to the pipeline.

pipeline.max_sequence_length

Must accommodate the templated prompt length your tokenizer produces.

Config hygiene. Any new field on DiffusionPipelineConfig must also be added to:

  • diffusion_rollout.yaml — both the top-level pipeline: section and val_kwargs.pipeline:.

  • diffusion_model.yaml — its pipeline: section, using ${oc.select:actor_rollout_ref.rollout.pipeline.<field>,<default>}.

5.2 Example Launch Script and Data Preprocessor

Ship a runnable example so users can launch training without trial and error. Use examples/flowgrpo_trainer/run_qwen_image_ocr_lora.sh and examples/flowgrpo_trainer/data_process/qwenimage_ocr.py as templates.

The data preprocessor’s tokenisation must match the upstream _encode_prompt exactly — same chat template, same special tokens, same enable_thinking flag, etc. Mismatches here cause silent reward collapse.

5.3 Use the VeOmni Backend

Backend selection is orthogonal to model integration: the adapters you wrote in Steps 3–4 work unchanged regardless of whether the actor runs on the default diffusers + FSDP2 engine or on VeOmni. Switching is a configuration concern handled by a few Hydra overrides at launch time.

What VeOmni reuses from your model adapter

  • DiffusionModelBase subclass (Step 3) — used verbatim. The VeOmni engine calls the same prepare_model_inputs / forward_and_sample_previous_step contract.

  • VllmOmniPipelineBase subclass (Step 4) — used verbatim. Rollout always runs in vllm-omni, independent of the actor backend.

  • FlowMatchSDEDiscreteScheduler (Step 3.1) — used verbatim.

What VeOmni requires that diffusers does not

  1. Upstream support in VeOmni. Just as diffusers must provide your <Name>Transformer2DModel, VeOmni must be able to load your model via its DiTTrainer path. If VeOmni does not yet support the architecture, upstream it there first (the diffusers prerequisite from Step 1 still applies for rollout — both upstreams are required).

  2. config_path / transformer_subfolder. The VeOmni engine loads the transformer from <local_path>/<transformer_subfolder> and the config from config_path (falling back to the weights path). These fields are already on DiffusionModelConfig and are shared with the diffusers backend, so no new model-specific fields are needed.

Launching with the VeOmni backend

diffusion/model_engine=veomni_diffusion switches the entire actor / reference Hydra schema; the other actor-engine fields then live under actor_rollout_ref.actor.veomni_config.* and actor_rollout_ref.ref.veomni_config.*:

python3 -m verl_omni.trainer.main_diffusion \
    diffusion/model_engine=veomni_diffusion \
    actor_rollout_ref.actor.strategy=veomni \
    actor_rollout_ref.actor.veomni_config.strategy=veomni \
    actor_rollout_ref.ref.veomni_config.strategy=veomni \
    ...  # everything else identical to your diffusers/FSDP2 recipe

See examples/flowgrpo_trainer/run_qwen_image_ocr_veomni.sh for a complete VeOmni recipe that mirrors run_qwen_image_ocr.sh line-for-line — the diff is only the engine-selection fields. Install instructions for VeOmni alongside vLLM 0.20.2 are in docs/start/install.md.

Mixing override schemas — don’t

diffusion/model_engine=veomni_diffusion selects the Hydra schema as a whole. Do not mix actor.fsdp_config.* and actor.veomni_config.* overrides in the same run — the fields for the other engine will be rejected as unknown keys at config-resolution time.


Step 6 — Add a Smoke Test

Add an end-to-end smoke test under tests/special_e2e/ modelled on tests/special_e2e/run_flowgrpo_qwen_image.sh. The script must exercise the full pipeline against a tiny-random/<ModelName> checkpoint:

  1. Generate dummy parquet data via tests/special_e2e/create_dummy_diffusion_data.py.

  2. Launch verl_omni.trainer.main_diffusion with model-specific knobs (architecture, prompt template, CFG parameters, sequence lengths).

  3. Assert exit code 0.

Then register the script in tests/gpu_smoke/run_gpu_smoke_tests.sh as a new numbered test entry. The runner already exports PYTHONUNBUFFERED=1 and RAY_DEDUP_LOGS=0 for readable logs — no need to set them in your script.


When to Refactor Instead of Duplicating

If you are copy-pasting more than a few lines from another model’s adapter, prefer one of:

  • Extending pipelines/utils.py with a generic helper.

  • Adding a method to DiffusionModelBase or VllmOmniPipelineBase so future models do not re-discover the contract.

  • Promoting a helper to a shared module once a second model needs it.

Refactor opportunistically: keep model-specific quirks local until a third model demands the same code, then unify.


Final Checklist

Before opening the PR, confirm every box:

  • [ ] verl_omni/pipelines/<model>_flow_grpo/ contains __init__.py, diffusers_training_adapter.py, and vllm_omni_rollout_adapter.py.

  • [ ] verl_omni/pipelines/__init__.py imports the new package.

  • [ ] Architecture string on both @register(...) decorators matches model_index.json::_class_name; the algorithm= keyword matches the algorithm you are integrating against (e.g. "flow_grpo" for FlowGRPO).

  • [ ] Scheduler returns latents in fp32 (no model_output.dtype cast in step()), diffuse() casts to model dtype before transformer forward and casts noise_pred to float32 before scheduler.step() — see Common Pitfalls.

  • [ ] Any new DiffusionPipelineConfig field is mirrored in both diffusion_rollout.yaml and diffusion_model.yaml.

  • [ ] Example launch script in examples/flowgrpo_trainer/ plus a matching data preprocessor under examples/flowgrpo_trainer/data_process/.

  • [ ] Smoke test tests/special_e2e/run_<algo>_<model>.sh exists and is wired into tests/gpu_smoke/run_gpu_smoke_tests.sh.

  • [ ] Docs updated (this guide if the contract changed; the relevant docs/algo/... page if you introduce algorithm-level concepts).